An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Overview

GPT-NeoX

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger. This repository is under development and may change rapidly without warning.

Requirements

Everything you need to get started running the code can be installed via pip:

$ pip install -r requirements.txt

Important: This codebase does not install Microsoft's DeepSpeed library. It installs DeeperSpeed, EleutherAI's variant on the original DeepSpeed. We have added some necessary functionality for our purposes and patched holes created by the fact that only parts of DeepSpeed were publicly released, but DeeperSpeed uses the same namespace as DeepSpeed and may break other code built upon DeepSpeed. If you use or suspect you might use Microsoft's DeepSpeed for another project, we strongly secommend you use anaconda to install this code in an isolated environment by creating a condo environment and running conda install --file requirements.txt. We welcome any suggestions for improvements to our DeepSpeeder library, but please open issues on its repo rather than this one.

EleutherAI members who wish to run models on our Kubernetes cluster will additionally need to install Kubernetes and obtain an authorization from Stella Biderman or Sid Black. Please reach out on discord in the #gpt-neo channel. You will also need to create a WandB account and share your username so that you can be added to the organization WandB account.

Running the code

The core anatomy of a call to the DeepSpeed engine is the following

$ deepspeed --hostfile=host_path train_script.py user_args\
	--deepspeed \
	--deepspeed_config deepspeed_config.json

where

  • host_path (optional) is the path to the host file containing the addresses of the machines you wish to train on.
  • train_script.py is the training script you wish to use. Our main training script is train_pipeline.py.
  • deepspeed_config.json is the json file containing DeepSpeed-specific hyperparameters.

In this repository, we provide a lightweight wrapper for the above function call for two main reasons. Firstly, we find the way the arguments are ordered and used somewhat counterintuitive, and secondly our wrapper automatically uploads logging data to WandB. Everything in this repository will work with both the native DeepSpeed command and with our deepy command. The core anatomy of a deepy call is

$ ./deepy --hostfile=host_path train_script.py deepspeed_config.json

Running the code locally

This code is set up to run automatically on as many GPUs as are avaliable. If you have multiple GPUs and only wish to make use of some of them, you can find information about how to specify which GPU(s) to use in training here.

The most common pitfall for local training is pipeline parallelism. Pipeline parallelism paritions the model into segments (called PipelineModules in our code) that can decrese latency by running partially asynchronously.

Running the code on a server

This code is set up to run automatically on as many GPUs as are avaliable. To run across multiple machines, you need to make use of a hostfile which lists the IP address of each machine you wish to run the code on followed by the number of GPUs to use. For example, 123.45.67.890 slots=8 instructs the code to run on all eight GPUs of the machine at 123.45.67.890. Each machine should be listed on a separate line with no end-of-line punctuation. It is officially recommended that you set up passwordless ssh, but we have had success entering the password at run-time. To have your hostfile used by GPT-NeoX automatically, store it at ~/jobs/hostfile. Otherwise, you can provide it as an argument as shown above.

EleutherAI members: Once you have been granted access to the EleutherAI servers and have confirmed that an unused cluster is currently running, simply ssh into the cluster. If you have been granted the ability to create an destroy Kubernetes clusters, run kubernetes/deploy_k8s.sh branch_name num_pods cluster_name to create a cluster.

~/scripts/

The directory ~/scripts/ stores various scripts for automatically starting runs with particular settings and configs that we have found useful. They can be run using sh scripts/script_name.sh but should not be relied upon. We do not guarentee forward compatibility of any scripts.

Datasets

Tokenizers

Using our data

Using your data

Advanced Options

Contribute

If you want to get involved, check out our repo projects. Anything that is listed as "todo" or has not been assigned to anyone is fair game, but please leave a comment so that we know you're working on it!

Resources

If you have trouble getting the model to run, consider consulting this guide to installing in a GCE virtual machine. You may also find the (very sparse) DeepSpeed docs helpful.

Comments
  • Running on a single GPU

    Running on a single GPU

    tried merging the checkpoints as described for single GPU python tools/merge20b.py --input_dir ./20B_checkpoints --output_dir ./20B_checkpoints_merged

    However Im getting this error when generating RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([50432, 6144]) from checkpoint, the shape in current model is torch.Size([50304, 6144]).

    How can I adjust to make the current model match size 50432? or is it the other way around?

    bug 
    opened by huey2531 22
  • Clean up Neox configuration

    Clean up Neox configuration

    Clean up neox configuration so config files can be used instead of a mishmash of files, command line args and enviroment variables.

    Aim:

    • All parameters can be set using passed json files
    • No parameters are repeated
    • Modify megatron's codebase as little as possible to make it easier to merge upstream megatron changes in the future.

    Nice to haves:

    Todo:

    • [x] Convert all examples and configs to new configuration
    • [ ] Create config documentation with all possible parameters
    • [x] Separate configs into: model and system
    • [x] Cast numbers to numbers in JSON (suggested by @StellaAthena)
    • [x] Calculate batch size from other parms (micro_batch_per_gpu*GAS*n_gpus)
    opened by joshlk 19
  • Model and config code of an HF gpt-neox model; a conversion script.

    Model and config code of an HF gpt-neox model; a conversion script.

    The modeling and configuration files are largely based on HF's gpt-j model. (I found gpt-j's architecture more similar to gpt-neox than gpt-neo, especially it uses rotary embedding)

    Modifications to the original gpt-j modeling:

    • Added post-attention layernorm as ln_2.
    • Changed q_proj, k_proj, v_proj linear layers to a single qkv_proj that corresponds to gpt-neox's attention.query_key_value linear layer. And set bias=True.
    • Combined gpt-neox's and HF gpt-j's rotatry embedding functions.
    • Set bias=False for lm_head.
    • Updated the computation in GPTNeoXBlock in correspondence to two residual computing ways in gpt-neox, which is controlled by a new config argument gpt_j_residual.

    Modifications to the original gpt-j configuration:

    • Set the default value of activation_function to gelu.
    • Removed rotary_dim (so that its default value is None).
    • Added a gpt_j_residual argument (default value is False) in correspondence to two residual computing ways in gpt-neox.

    A conversion script:

    • that reads config files and gpt-neox's output state dict files and outputs a pretrained HF pytorch GPTNeoX model.
    • Note that weights are not loaded for two kinds of model parameters transformer.h.*.attn.bias and transformer.h.*.attn.masked_bias because they should keep their default values.

    Things that I have checked with a 1B model, which is trained basically following the default XL.yml config:

    • The above conversion script works correctly.
    • The greedy-decoding outputs by gpt-neox's inference script and HF's generate() are identical.
    • Intermediate outputs (e.g. hidden states) are almost identical when running from gpt-neox code and HF code. There are some small differences, which I think are caused by precision settings.

    Things that are not included in this pull request:

    • Tensorflow-related code.
    • Conversion script that considers checkpoints trained with model parallel.
    • Other model variants that might use a different type of, for example, rotary embeddings.
    • Things that haven't come to my mind.

    BTW, I haven't found a better place to put the new files so I simply created a directory huggingface under /tools.

    A table that summarizes parameters (and their shapes) in 1) a GPT-NeoX checkpoint, 2) an HF GPTNeo model, 3) an HF GPTJ model, and 4) the HF GPTNeoX model in this pull request: Screen Shot 2021-12-15 at 13 29 32

    opened by ZHAOTING 16
  • align gpt-j layernorm to hf

    align gpt-j layernorm to hf

    Looking deeper into the gpt-j residual implementation I found a delta in the way layernorm(s) are applied. I don't see the point in applying two separate layer norm modules to the hidden_states (x)

    Compare the HF implementation. https://github.com/huggingface/transformers/blob/a94105f95fb66ee4129077c03e4e8a224f6a07fd/src/transformers/models/gptj/modeling_gptj.py#L279

    Is there a reason for having two layernorms? Am I completally off?

    opened by sweinbach 15
  • 13B Model Out of Memory with Single Node 8 A100 GPUs

    13B Model Out of Memory with Single Node 8 A100 GPUs

    Hi!

    Thanks for contribution making this repo available :)

    I tried to train the 13B model with micro batch size 1, model parallelism degree 8, but unable to get it to work. (always get OOM) The library advertises being able to scale up to 100B. What is required for this? I also tried deepspeed stage 3 with offload without using pipeline parallelism but that doesn't seem to work either. Please let me know what I'm missing. thanks!

    opened by benathi 14
  • Add support for Flash attention

    Add support for Flash attention

    This PR adds Tri Dao's Flash Attention as an optional backend for the global attention operation, enabled by setting the attention_config to [[["flash"], ...]. I've tested the changes in my own environment and consistently see a 2x boost for 4K sequence lengths in models ranging from 100M - 3B parameters.

    Maybe relevant: @tridao @lucidrains

    opened by VHellendoorn 13
  • Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together

    Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together

    Pipeline parallelism and gradient checkpointing both work when you use them individually. However when you turn them both on you get a mysterious KeyError: 0 from somewhere deep in DeepSpeed.

    bug 
    opened by StellaAthena 12
  • distributed training with multipy nodes.

    distributed training with multipy nodes.

    Hi, I want train a 13B model with 4 nodes, each node have 8*A100 GPUs, but I don't know how to run the code on my cluster, can you show me an example? I just run it on a single node successful.

    bug 
    opened by cdj0311 11
  • fix alibi inference shapes for cached layer_past

    fix alibi inference shapes for cached layer_past

    Restart the now reverted previous fix.

    History:

    • Old PR was merged, after which some (small?) differences in model output became apparent in discussion with @sdtblck
    • Old PR was reverted
    • This PR is opened to discuss the issue

    Tests and validations so far:

    Inference was tested on a trained neox checkpoint (model size ~4B, 60k steps trained). Random sampling is deactivated (no top_k, top_p, temperature).

    1. test NOT using recompute, i.e. with cache values used in the text generation (interactive)
    python deepy.py generate.py -d configs ... text_generation
    

    (copy pasted from terminal) Context prompt >>> Once upon a time Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => the model is working => generated text seems far from random (though not a fairytale :-) )

    1. test using recompute, i.e. with NO cache values used in the text generation (interactive)
    python deepy.py generate.py -d configs ... text_generation
    

    (copy pasted from terminal) Context prompt >>> Once upon a time Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => The generated text is exactly the same as above

    Tests and validations that resulted in discussions.

    The following line has been added to text_generation_utils.py to print logits. image

    1. test NOT using recompute, logit output

    (copy pasted from terminal) Context prompt >>> Once upon a time generated_tokens tensor(15, device='cuda:0') generated_token_logits tensor(42.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(37.4375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1711, device='cuda:0') generated_token_logits tensor(33.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2333, device='cuda:0') generated_token_logits tensor(36.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(42.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1635, device='cuda:0') generated_token_logits tensor(40.0312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(39.0938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6834, device='cuda:0') generated_token_logits tensor(36.5000, device='cuda:0', dtype=torch.float16) generated_tokens tensor(554, device='cuda:0') generated_token_logits tensor(40.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1287, device='cuda:0') generated_token_logits tensor(34.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(36.5938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(39.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2329, device='cuda:0') generated_token_logits tensor(34.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(45857, device='cuda:0') generated_token_logits tensor(33.4062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(348, device='cuda:0') generated_token_logits tensor(40.5312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(40126, device='cuda:0') generated_token_logits tensor(38.3125, device='cuda:0', dtype=torch.float16) generated_tokens tensor(280, device='cuda:0') generated_token_logits tensor(41.7188, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(41.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(5510, device='cuda:0') generated_token_logits tensor(36.5000, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6405, device='cuda:0') generated_token_logits tensor(41.2812, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(44.4062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(7384, device='cuda:0') generated_token_logits tensor(43.4688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(672, device='cuda:0') generated_token_logits tensor(45.3750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(46., device='cuda:0', dtype=torch.float16) generated_tokens tensor(779, device='cuda:0') generated_token_logits tensor(39.9062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(42.6875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2556, device='cuda:0') generated_token_logits tensor(36.8438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(3991, device='cuda:0') generated_token_logits tensor(42.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(30480, device='cuda:0') generated_token_logits tensor(40.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(17, device='cuda:0') generated_token_logits tensor(42.8125, device='cuda:0', dtype=torch.float16) Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => the generated text is the same as above

    1. test using recompute, logit output (copy pasted from terminal) Context prompt >>> Once upon a time generated_tokens tensor(15, device='cuda:0') generated_token_logits tensor(42.1562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(37.4375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1711, device='cuda:0') generated_token_logits tensor(33.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2333, device='cuda:0') generated_token_logits tensor(36.9062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(42.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1635, device='cuda:0') generated_token_logits tensor(40.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(39.0938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6834, device='cuda:0') generated_token_logits tensor(36.5312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(554, device='cuda:0') generated_token_logits tensor(40.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1287, device='cuda:0') generated_token_logits tensor(35.0312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(36.5625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(40.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2329, device='cuda:0') generated_token_logits tensor(34.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(45857, device='cuda:0') generated_token_logits tensor(33.3125, device='cuda:0', dtype=torch.float16) generated_tokens tensor(348, device='cuda:0') generated_token_logits tensor(40.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(40126, device='cuda:0') generated_token_logits tensor(38.2812, device='cuda:0', dtype=torch.float16) generated_tokens tensor(280, device='cuda:0') generated_token_logits tensor(41.6875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(41.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(5510, device='cuda:0') generated_token_logits tensor(36.4688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6405, device='cuda:0') generated_token_logits tensor(41.3750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(44.3438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(7384, device='cuda:0') generated_token_logits tensor(43.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(672, device='cuda:0') generated_token_logits tensor(45.2188, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(46.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(779, device='cuda:0') generated_token_logits tensor(39.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(42.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2556, device='cuda:0') generated_token_logits tensor(36.8438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(3991, device='cuda:0') generated_token_logits tensor(42.9375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(30480, device='cuda:0') generated_token_logits tensor(39.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(17, device='cuda:0') generated_token_logits tensor(42.9375, device='cuda:0', dtype=torch.float16) Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => Generated text still the same => Logits are different. This is the question at hand

    Questions

    • Why are logits different?
    • Does the difference only occur for alibi? Is this a general issue if an issue at all?
    opened by sweinbach 11
  • Running through Dockerfile broken

    Running through Dockerfile broken

    Describe the bug When using an image based on the provided Dockerfile and running the quick start steps (download enron data, run deep.py), execution crashes before training begins.

    To Reproduce Steps to reproduce the behavior:

    1. Build an image using the provided Dockerfile
    2. Run said image, mounting 8 RTX800 GPUs
    3. Fetch enron data using the prepare_dataset.py script
    4. Run ./deepy.py pretrain_gpt2.py -d configs small.yml local_configs.yml
    5. The code crashes with a non-descript NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

    Expected behavior Training starts or a specific error is provided.

    Proposed solution The NCCL error is typically a stand-in for a real issue that is not relayed back through multiprocessing. As a first step, it would be nice to know if this setup works out-of-the-box for others; in that case, it might be my resources or CUDA version.

    Environment (please complete the following information):

    • GPUs: 8 RTX8000 GPUs
    • Configs: Ubuntu 20.04, Cuda 11.2

    Additional context Add any other context about the problem here.

    bug 
    opened by VHellendoorn 11
  • Create experiment runners

    Create experiment runners

    We will want to run experiments with a variety of configs and options. To enable this, we need two things:

    • [ ] configs files that we can use to specify the settings for a particular run
    • [ ] an experiment runner for managing and automatically executing several runs
    feature request good first issue 
    opened by StellaAthena 11
  • In interactive mode prompt length more than one word causes to crash

    In interactive mode prompt length more than one word causes to crash

    Describe the bug In interactive mode prompt length more than one word causes to crash. When I type just one word it generates text though.

    text_generation.yml

    ` { "text-gen-type": "interactive", "maximum_tokens": 500, "temperature": 0.9, "top_p": 0, "top_k": 0, "recompute": false, "num-samples": 10, "sample-input-file": "prompt.txt", "sample-output-file": "sample_output.txt", }

    `

    `Context prompt >>> Hello from Traceback (most recent call last): Traceback (most recent call last): File "generate.py", line 89, in File "generate.py", line 89, in main() File "generate.py", line 72, in main main() File "generate.py", line 72, in main generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 760, in generate_samples_interactive generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 760, in generate_samples_interactive for ( File "/gpt-neox/megatron/text_generation_utils.py", line 316, in stream_tokens for ( File "/gpt-neox/megatron/text_generation_utils.py", line 316, in stream_tokens logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel) File "/gpt-neox/megatron/text_generation_utils.py", line 156, in forward_model logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel) File "/gpt-neox/megatron/text_generation_utils.py", line 156, in forward_model loss, logits = model.eval_batch(model_inputs, return_logits=True) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 394, in eval_batch loss, logits = model.eval_batch(model_inputs, return_logits=True) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 394, in eval_batch self._exec_schedule(sched) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule self._exec_schedule(sched) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 700, in _exec_forward_pass self._exec_instr(**cmd.kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 700, in _exec_forward_pass self.loss = self.loss_model(outputs, labels) File "/gpt-neox/megatron/model/gpt2_model.py", line 67, in cross_entropy losses = mpu.vocab_parallel_cross_entropy(output.float().contiguous(), labels) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 117, in vocab_parallel_cross_entropy self.loss = self.loss_model(outputs, labels) File "/gpt-neox/megatron/model/gpt2_model.py", line 67, in cross_entropy return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 63, in forward losses = mpu.vocab_parallel_cross_entropy(output.float().contiguous(), labels) predicted_logits_1d = logits_2d[arange_1d, masked_target_1d] File "/gpt-neox/megatron/mpu/cross_entropy.py", line 117, in vocab_parallel_cross_entropy

    IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [2], [3] return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 63, in forward predicted_logits_1d = logits_2d[arange_1d, masked_target_1d] IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [2], [3] Killing subprocess 7479 Killing subprocess 7480 Killing subprocess 7481 Killing subprocess 7482 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in main() File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 9.7e-05, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1260000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1260000000, "contiguous_gradients": true}, "steps_per_print": 2}', '--megatron_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 9.7e-05, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1260000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1260000000, "contiguous_gradients": true}, "steps_per_print": 2, "precision": "fp16", "num_layers": 44, "hidden_size": 6144, "num_attention_heads": 64, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "scaled_upper_triang_masked_softmax_fusion": true, "bias_gelu_fusion": true, "rotary_pct": 0.25, "init_method": "small_init", "output_layer_init_method": "wang_init", "gpt_j_residual": true, "gpt_j_tied": true, "output_layer_parallelism": "column", "lr_decay_style": "cosine", "lr_decay_iters": 150000, "min_lr": 9.7e-06, "optimizer_type": "Adam", "zero_stage": 1, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 1260000000, "zero_allgather_bucket_size": 1260000000, "lr": 9.7e-05, "tokenizer_type": "HFTokenizer", "data_path": "./data/pile_20B_tokenizer/pile_20B_tokenizer_text_document", "data_impl": "mmap", "save": "./20B_checkpoints", "config_files": {"20B.yml": "# DISCLAIMER: This is the configuration file for the GPT-NeoX-20B model as it was trained on 96x 40GB A100\n# GPUs. Depending on your system configuration, you may need to change some parameters in order to fit\n# the model in memory.\n\n{\n # Tokenizer / checkpoint settings - you will need to change these to the location you have them saved in\n \"vocab-file\": \"./20B_checkpoints/20B_tokenizer.json\",\n \"save\": \"./20B_checkpoints\",\n \"load\": \"./20B_checkpoints\",\n\n # If finetuning, edit the following to the location of your finetuning dataset:\n \"data-path\": \"./data/pile_20B_tokenizer/pile_20B_tokenizer_text_document\",\n\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 2,\n \"model-parallel-size\": 2,\n\n # model settings\n \"num-layers\": 44,\n \"hidden-size\": 6144,\n \"num-attention-heads\": 64,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"rotary_pct\": 0.25,\n \"no-weight-tying\": true,\n \"gpt_j_residual\": true,\n \"gpt_j_tied\": true,\n \"output_layer_parallelism\": \"column\",\n \"scaled-upper-triang-masked-softmax-fusion\": true,\n \"bias-gelu-fusion\": true,\n\n # init methods\n \"init_method\": \"small_init\",\n \"output_layer_init_method\": \"wang_init\",\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.97e-4,\n \"betas\": [0.9, 0.95],\n \"eps\": 1.0e-8,\n }\n },\n\n \"min_lr\": 0.97e-5,\n\n # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n \"zero_optimization\": {\n \"stage\": 1,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 1260000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 1260000000,\n \"contiguous_gradients\": True,\n },\n\n # batch / data settings (assuming 96 GPUs)\n \"train_micro_batch_size_per_gpu\": 4,\n \"gradient_accumulation_steps\": 32,\n \"data-impl\": \"mmap\",\n \"split\": \"995,4,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": false,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.01,\n \"hidden-dropout\": 0,\n \"attention-dropout\": 0,\n\n # precision settings\n \"fp16\": {\n \"fp16\": true,\n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"initial_scale_power\": 12,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 150000,\n \"lr-decay-iters\": 150000,\n\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"checkpoint-factor\": 500,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 2,\n \"steps_per_print\": 2,\n \"wall_clock_breakdown\": false,\n\n ### NEW DATA: ####\n \"tokenizer_type\": \"HFTokenizer\",\n \"tensorboard-dir\": \"./tensorboard\",\n \"log-dir\": \"./logs\",\n\n}\n", "text_generation_interactive.yml": "# Parameters used for text generation\n# Make sure load is specified somewhere else\n{\n # Text gen type: input-file, unconditional or interactive\n \"text-gen-type\": \"interactive\",\n\n # Params for all\n \"maximum_tokens\": 500,\n \"temperature\": 0.9,\n \"top_p\": 0,\n \"top_k\": 0,\n \"recompute\": false,\n\n # unconditional: samples\n \"num-samples\": 10,\n\n # input/output file\n \"sample-input-file\": \"prompt.txt\",\n \"sample-output-file\": \"sample_output.txt\",\n}\n"}, "load": "./20B_checkpoints", "checkpoint_factor": 500, "batch_size": 4, "train_iters": 150000, "eval_iters": 10, "split": "995,4,1", "vocab_file": "./20B_checkpoints/20B_tokenizer.json", "attention_dropout": 0, "hidden_dropout": 0, "checkpoint_activations": true, "synchronize_each_layer": true, "gas": 32, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "model_parallel_size": 2, "is_pipe_parallel": true, "wandb_group": "72fp5jTbC3iYzFUHnE9Fh2_35wkyadj", "log_dir": "./logs", "tensorboard_dir": "./tensorboard", "log_interval": 2, "text_gen_type": "interactive", "temperature": 0.9, "maximum_tokens": 500, "sample_input_file": "prompt.txt", "sample_output_file": "sample_output.txt", "num_samples": 10, "user_script": "generate.py", "save_iters": [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, 10500, 11000, 11500, 12000, 12500, 13000, 13500, 14000, 14500, 15000, 15500, 16000, 16500, 17000, 17500, 18000, 18500, 19000, 19500, 20000, 20500, 21000, 21500, 22000, 22500, 23000, 23500, 24000, 24500, 25000, 25500, 26000, 26500, 27000, 27500, 28000, 28500, 29000, 29500, 30000, 30500, 31000, 31500, 32000, 32500, 33000, 33500, 34000, 34500, 35000, 35500, 36000, 36500, 37000, 37500, 38000, 38500, 39000, 39500, 40000, 40500, 41000, 41500, 42000, 42500, 43000, 43500, 44000, 44500, 45000, 45500, 46000, 46500, 47000, 47500, 48000, 48500, 49000, 49500, 50000, 50500, 51000, 51500, 52000, 52500, 53000, 53500, 54000, 54500, 55000, 55500, 56000, 56500, 57000, 57500, 58000, 58500, 59000, 59500, 60000, 60500, 61000, 61500, 62000, 62500, 63000, 63500, 64000, 64500, 65000, 65500, 66000, 66500, 67000, 67500, 68000, 68500, 69000, 69500, 70000, 70500, 71000, 71500, 72000, 72500, 73000, 73500, 74000, 74500, 75000, 75500, 76000, 76500, 77000, 77500, 78000, 78500, 79000, 79500, 80000, 80500, 81000, 81500, 82000, 82500, 83000, 83500, 84000, 84500, 85000, 85500, 86000, 86500, 87000, 87500, 88000, 88500, 89000, 89500, 90000, 90500, 91000, 91500, 92000, 92500, 93000, 93500, 94000, 94500, 95000, 95500, 96000, 96500, 97000, 97500, 98000, 98500, 99000, 99500, 100000, 100500, 101000, 101500, 102000, 102500, 103000, 103500, 104000, 104500, 105000, 105500, 106000, 106500, 107000, 107500, 108000, 108500, 109000, 109500, 110000, 110500, 111000, 111500, 112000, 112500, 113000, 113500, 114000, 114500, 115000, 115500, 116000, 116500, 117000, 117500, 118000, 118500, 119000, 119500, 120000, 120500, 121000, 121500, 122000, 122500, 123000, 123500, 124000, 124500, 125000, 125500, 126000, 126500, 127000, 127500, 128000, 128500, 129000, 129500, 130000, 130500, 131000, 131500, 132000, 132500, 133000, 133500, 134000, 134500, 135000, 135500, 136000, 136500, 137000, 137500, 138000, 138500, 139000, 139500, 140000, 140500, 141000, 141500, 142000, 142500, 143000, 143500, 144000, 144500, 145000, 145500, 146000, 146500, 147000, 147500, 148000, 148500, 149000, 149500], "global_num_gpus": 4}']' returned non-zero exit status 1. [email protected]`

    bug 
    opened by ahmedavid 0
  • Upstream DeepSpeed -> HF checkpoint conversion script update

    Upstream DeepSpeed -> HF checkpoint conversion script update

    This PR fixes #750 . Upstream DeepSpeed saves checkpoints in a new layout which is incompatible with the old conversion script. This makes convert_to_hf.py work with upstream DeepSpeed, and leaves legacy_convert_to_hf.py for conversion from DeeperSpeed.

    Draft for now because I need to test on a real model.

    opened by haileyschoelkopf 0
  • Upstream DeepSpeed breaks HF conversion script

    Upstream DeepSpeed breaks HF conversion script

    The tools/convert_to_hf.py script will need to be updated / a different version may need to be created for checkpoints saved with DeepSpeed. Checkpoints are no longer saved layer-by-layer, it seems, and now all weights are in several mp_rank_{MP_RANK}_model_states.pt files for each Model Parallel partition.

    Upstream DeepSpeed checkpoint:

    drwxr-xr-x 2 hailey eleuther     33280 Dec 18 14:29 configs
    -rw-r--r-- 1 hailey eleuther 810771646 Dec 18 14:29 mp_rank_00_model_states.pt
    -rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_0_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_1_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_2_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_3_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_4_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_5_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_6_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_7_mp_rank_00_optim_states.pt
    

    DeeperSpeed checkpoint:

    drwxrwxrwx 2 hailey eleuther     33280 Nov 18 04:55 configs
    -rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_00-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_02-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_03-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_04-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_05-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_06-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_07-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_08-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_09-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_10-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_11-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_12-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_13-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_14-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_15-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_16-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_17-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_18-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_19-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_20-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_21-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_22-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_23-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_24-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_25-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther      9127 Nov 18 04:55 layer_27-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_28-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther     16291 Nov 18 04:55 mp_rank_00_model_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_0_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_10_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_11_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_12_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_13_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_14_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_15_mp_rank_00_optim_states.pt
    ...
    

    Updating the script shouldn't be too hard at all though.

    bug 
    opened by haileyschoelkopf 0
  • Issue deploying GPT-NeoX-20b on AWS Sagemaker with Jupyter Notebook

    Issue deploying GPT-NeoX-20b on AWS Sagemaker with Jupyter Notebook

    Describe the bug I get the following error upon trying to use predictor.predict(data) for AWS Sagemaker.

    ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
      "code": 400,
      "type": "InternalServerException",
      "message": "Could not load model /.sagemaker/mms/models/EleutherAI__gpt-neox-20b with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM\u0027\u003e)."
    }
    

    To Reproduce Steps to reproduce the behavior:

    1. Create Dockerfile
    2. Add the following into Dockerfile
    FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
    RUN pip install --upgrade 'transformers==4.25.1'
    RUN pip install --upgrade 'torch==1.13.0'
    
    1. Build the Dockerfile via command docker build -t gpt-neox . in the directory Dockerfille is in
    2. Create a file named dockerize.sh
    3. Add the following content into the file
    %%sh
    
    # Specify an algorithm name
    algorithm_name=gpt-neox
    
    account=$(aws sts get-caller-identity --query Account --output text)
    
    # Get the region defined in the current configuration (default to us-west-2 if none defined)
    region=$(aws configure get region)
    
    fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
    
    # If the repository doesn't exist in ECR, create it.
    
    aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
    if [ $? -ne 0 ]
    then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
    fi
    
    # Log into Docker
    aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}
    
    # Build the docker image locally with the image name and then push it to ECR
    # with the full name.
    
    docker build -t ${algorithm_name} .
    docker tag ${algorithm_name} ${fullname}
    
    docker push ${fullname}
    
    1. Run command docker login (you need docker cli)
    2. Execute the shell script file (you need aws cli)
    3. Open Jupyter Notebook
    4. Add the following to Jupyter Notebook
    %pip install sagemaker
    %pip install boto3
    
    from sagemaker.huggingface import HuggingFaceModel
    import boto3
    
    client=boto3.client('sts')
    account=client.get_caller_identity()['Account']
    
    my_session=boto3.session.Session()
    region=my_session.region_name
    
    algorithm_name="gpt-neox"
    tag="latest"
    ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, algorithm_name, tag)
    
    role = 'SageMaker'
    
    hub = {
        'HF_MODEL_ID':'EleutherAI/gpt-neox-20b',
        'HF_TASK':'text-generation'
    }
    
    huggingface_model = HuggingFaceModel(
        image_uri=ecr_image,
        env=hub,
        role=role,
    #     transformers_version="4.17", these are not needed anymore
    #     pytorch_version="1.10",
    #     py_version="py38",
    )
    
    predictor = huggingface_model.deploy(
        initial_instance_count=1,
        instance_type="ml.m5.xlarge"
    )
    
    1. Run this final command in Jupyter Notebook once predictor is done.
    predictor.predict({
        "inputs": "The weather is"
    })
    
    1. See error

    Expected behavior I expect to see a generated query from the input using the 20 billion parameters pretrained EleutherAI model.

    Proposed solution I suspect I could fix this issue if I all together ditched the huggingface Sagemaker library. Also, the model hasn't been updated for the last 8 months, so I'm not sure if it is due to that.

    Environment (please complete the following information):

    • GPUs: none
    • Configs: unsure

    Additional context I have tried the other versions of GPT-Neo's like 125M and 2.7B and those have worked perfectly. The reason that I need to extend the Docker container for AWS is to not get another error which is apparently an issue with the latest version of transformers (4.17?) on the default docker is not up to date enough.

    bug 
    opened by BjornTheProgrammer 1
  • Model ckpts from `DeeperSpeed` cannot be loaded using `deepspeed_main`/upstream DeepSpeed

    Model ckpts from `DeeperSpeed` cannot be loaded using `deepspeed_main`/upstream DeepSpeed

    Describe the bug

    Using DeeperSpeed-trained model checkpoints (git+https://github.com/EleutherAI/DeeperSpe[email protected]#egg=deepspeed ),

    loading them raises an error when trying to use deepspeed_main with the upstream DeepSpeed.

    To Reproduce Steps to reproduce the behavior:

    Train a model from main branch using DeeperSpeed (or download model ckpt from s-eai-neox/pythia/1.3B/global_step71500)

    Try to load this checkpoint using the deepspeed_main branch and upstream Deepspeed (for either training or evaluation), gives the following error:

    Traceback (most recent call last):
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/evaluate.py", line 76, in <module>
        main()
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/evaluate.py", line 35, in main
        model, neox_args = setup_for_inference_or_eval(use_cache=False)
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/utils.py", line 440, in setup_for_inference_or_eval
        model, _, _ = setup_model_and_optimizer(
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/training.py", line 437, in setup_model_and_optimizer
        neox_args.iteration = load_checkpoint(
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/checkpointing.py", line 235, in load_checkpoint
        checkpoint_name, state_dict = model.load_checkpoint(
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2647, in load_checkpoint
        load_path, client_states = self._load_checkpoint(load_dir,
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2713, in _load_checkpoint
        self.load_module_state_dict(state_dict=checkpoint['module'],
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2507, in load_module_state_dict
        self.module.load_state_dict(state_dict, # TODO
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1620, in load_state_dict
        raise TypeError("Expected state_dict to be dict-like, got {}.".format(type(state_dict)))
    TypeError: Expected state_dict to be dict-like, got <class 'NoneType'>.
    

    This gives the above traceback and checkpoint loading fails.

    Expected behavior The checkpoints should be loadable by either Deepspeed version, ideally.

    Proposed solution This could be an issue with Deepspeed checkpoint formats changing over the course of 4 versions--not sure yet.

    Additional context Relevant to merging #663 since we have checkpoints we want to use trained in DeeperSpeed.

    cc @Quentin-Anthony @dashstander @StellaAthena

    bug 
    opened by haileyschoelkopf 3
Releases(legacy_gptj_residual.1.0.0)
Owner
EleutherAI
EleutherAI
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
NLP Text Classification

多标签文本分类任务 近年来随着深度学习的发展,模型参数的数量飞速增长。为了训练这些参数,需要更大的数据集来避免过拟合。然而,对于大部分NLP任务来说,构建大规模的标注数据集非常困难(成本过高),特别是对于句法和语义相关的任务。相比之下,大规模的未标注语料库的构建则相对容易。为了利用这些数据,我们可以

Jason 1 Nov 11, 2021
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

Phil Wang 2.3k Jan 01, 2023
Text classification on IMDB dataset using Keras and Bi-LSTM network

Text classification on IMDB dataset using Keras and Bi-LSTM Text classification on IMDB dataset using Keras and Bi-LSTM network. Usage python3 main.py

Hamza Rashid 2 Sep 27, 2022
Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Rethinking the Truly Unsupervised Image-to-Image Translation (ICCV 2021) Each image is generated with the source image in the left and the average sty

Clova AI Research 436 Dec 27, 2022
VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

VampiresVsWerewolves Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition. Our Algorithm finish

Shawn 1 Jan 21, 2022
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch=1.6.0,=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

KDD 483 Jan 04, 2023
A desktop GUI providing an audio interface for GPT3.

Jabberwocky neil_degrasse_tyson_with_audio.mp4 Project Description This GUI provides an audio interface to GPT-3. My main goal was to provide a conven

16 Nov 27, 2022
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

molten A minimal, extensible, fast and productive API framework for Python 3. Changelog: https://moltenframework.com/changelog.html Community: https:/

3.2k Dec 28, 2022
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

MITRE Cybersecurity 64 Nov 22, 2022
Use fastai-v2 with HuggingFace's pretrained transformers

FastHugs Use fastai v2 with HuggingFace's pretrained transformers, see the notebooks below depending on your task: Text classification: fasthugs_seq_c

Morgan McGuire 111 Nov 16, 2022
👑 spaCy building blocks and visualizers for Streamlit apps

spacy-streamlit: spaCy building blocks for Streamlit apps This package contains utilities for visualizing spaCy models and building interactive spaCy-

Explosion 620 Dec 29, 2022
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
Code Generation using a large neural network called GPT-J

CodeGenX is a Code Generation system powered by Artificial Intelligence! It is delivered to you in the form of a Visual Studio Code Extension and is Free and Open-source!

DeepGenX 389 Dec 31, 2022
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to ach

Keon Lee 237 Jan 02, 2023