An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Overview

GPT-NeoX

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger. This repository is under development and may change rapidly without warning.

Requirements

Everything you need to get started running the code can be installed via pip:

$ pip install -r requirements.txt

Important: This codebase does not install Microsoft's DeepSpeed library. It installs DeeperSpeed, EleutherAI's variant on the original DeepSpeed. We have added some necessary functionality for our purposes and patched holes created by the fact that only parts of DeepSpeed were publicly released, but DeeperSpeed uses the same namespace as DeepSpeed and may break other code built upon DeepSpeed. If you use or suspect you might use Microsoft's DeepSpeed for another project, we strongly secommend you use anaconda to install this code in an isolated environment by creating a condo environment and running conda install --file requirements.txt. We welcome any suggestions for improvements to our DeepSpeeder library, but please open issues on its repo rather than this one.

EleutherAI members who wish to run models on our Kubernetes cluster will additionally need to install Kubernetes and obtain an authorization from Stella Biderman or Sid Black. Please reach out on discord in the #gpt-neo channel. You will also need to create a WandB account and share your username so that you can be added to the organization WandB account.

Running the code

The core anatomy of a call to the DeepSpeed engine is the following

$ deepspeed --hostfile=host_path train_script.py user_args\
	--deepspeed \
	--deepspeed_config deepspeed_config.json

where

  • host_path (optional) is the path to the host file containing the addresses of the machines you wish to train on.
  • train_script.py is the training script you wish to use. Our main training script is train_pipeline.py.
  • deepspeed_config.json is the json file containing DeepSpeed-specific hyperparameters.

In this repository, we provide a lightweight wrapper for the above function call for two main reasons. Firstly, we find the way the arguments are ordered and used somewhat counterintuitive, and secondly our wrapper automatically uploads logging data to WandB. Everything in this repository will work with both the native DeepSpeed command and with our deepy command. The core anatomy of a deepy call is

$ ./deepy --hostfile=host_path train_script.py deepspeed_config.json

Running the code locally

This code is set up to run automatically on as many GPUs as are avaliable. If you have multiple GPUs and only wish to make use of some of them, you can find information about how to specify which GPU(s) to use in training here.

The most common pitfall for local training is pipeline parallelism. Pipeline parallelism paritions the model into segments (called PipelineModules in our code) that can decrese latency by running partially asynchronously.

Running the code on a server

This code is set up to run automatically on as many GPUs as are avaliable. To run across multiple machines, you need to make use of a hostfile which lists the IP address of each machine you wish to run the code on followed by the number of GPUs to use. For example, 123.45.67.890 slots=8 instructs the code to run on all eight GPUs of the machine at 123.45.67.890. Each machine should be listed on a separate line with no end-of-line punctuation. It is officially recommended that you set up passwordless ssh, but we have had success entering the password at run-time. To have your hostfile used by GPT-NeoX automatically, store it at ~/jobs/hostfile. Otherwise, you can provide it as an argument as shown above.

EleutherAI members: Once you have been granted access to the EleutherAI servers and have confirmed that an unused cluster is currently running, simply ssh into the cluster. If you have been granted the ability to create an destroy Kubernetes clusters, run kubernetes/deploy_k8s.sh branch_name num_pods cluster_name to create a cluster.

~/scripts/

The directory ~/scripts/ stores various scripts for automatically starting runs with particular settings and configs that we have found useful. They can be run using sh scripts/script_name.sh but should not be relied upon. We do not guarentee forward compatibility of any scripts.

Datasets

Tokenizers

Using our data

Using your data

Advanced Options

Contribute

If you want to get involved, check out our repo projects. Anything that is listed as "todo" or has not been assigned to anyone is fair game, but please leave a comment so that we know you're working on it!

Resources

If you have trouble getting the model to run, consider consulting this guide to installing in a GCE virtual machine. You may also find the (very sparse) DeepSpeed docs helpful.

Comments
  • Running on a single GPU

    Running on a single GPU

    tried merging the checkpoints as described for single GPU python tools/merge20b.py --input_dir ./20B_checkpoints --output_dir ./20B_checkpoints_merged

    However Im getting this error when generating RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([50432, 6144]) from checkpoint, the shape in current model is torch.Size([50304, 6144]).

    How can I adjust to make the current model match size 50432? or is it the other way around?

    bug 
    opened by huey2531 22
  • Clean up Neox configuration

    Clean up Neox configuration

    Clean up neox configuration so config files can be used instead of a mishmash of files, command line args and enviroment variables.

    Aim:

    • All parameters can be set using passed json files
    • No parameters are repeated
    • Modify megatron's codebase as little as possible to make it easier to merge upstream megatron changes in the future.

    Nice to haves:

    Todo:

    • [x] Convert all examples and configs to new configuration
    • [ ] Create config documentation with all possible parameters
    • [x] Separate configs into: model and system
    • [x] Cast numbers to numbers in JSON (suggested by @StellaAthena)
    • [x] Calculate batch size from other parms (micro_batch_per_gpu*GAS*n_gpus)
    opened by joshlk 19
  • Model and config code of an HF gpt-neox model; a conversion script.

    Model and config code of an HF gpt-neox model; a conversion script.

    The modeling and configuration files are largely based on HF's gpt-j model. (I found gpt-j's architecture more similar to gpt-neox than gpt-neo, especially it uses rotary embedding)

    Modifications to the original gpt-j modeling:

    • Added post-attention layernorm as ln_2.
    • Changed q_proj, k_proj, v_proj linear layers to a single qkv_proj that corresponds to gpt-neox's attention.query_key_value linear layer. And set bias=True.
    • Combined gpt-neox's and HF gpt-j's rotatry embedding functions.
    • Set bias=False for lm_head.
    • Updated the computation in GPTNeoXBlock in correspondence to two residual computing ways in gpt-neox, which is controlled by a new config argument gpt_j_residual.

    Modifications to the original gpt-j configuration:

    • Set the default value of activation_function to gelu.
    • Removed rotary_dim (so that its default value is None).
    • Added a gpt_j_residual argument (default value is False) in correspondence to two residual computing ways in gpt-neox.

    A conversion script:

    • that reads config files and gpt-neox's output state dict files and outputs a pretrained HF pytorch GPTNeoX model.
    • Note that weights are not loaded for two kinds of model parameters transformer.h.*.attn.bias and transformer.h.*.attn.masked_bias because they should keep their default values.

    Things that I have checked with a 1B model, which is trained basically following the default XL.yml config:

    • The above conversion script works correctly.
    • The greedy-decoding outputs by gpt-neox's inference script and HF's generate() are identical.
    • Intermediate outputs (e.g. hidden states) are almost identical when running from gpt-neox code and HF code. There are some small differences, which I think are caused by precision settings.

    Things that are not included in this pull request:

    • Tensorflow-related code.
    • Conversion script that considers checkpoints trained with model parallel.
    • Other model variants that might use a different type of, for example, rotary embeddings.
    • Things that haven't come to my mind.

    BTW, I haven't found a better place to put the new files so I simply created a directory huggingface under /tools.

    A table that summarizes parameters (and their shapes) in 1) a GPT-NeoX checkpoint, 2) an HF GPTNeo model, 3) an HF GPTJ model, and 4) the HF GPTNeoX model in this pull request: Screen Shot 2021-12-15 at 13 29 32

    opened by ZHAOTING 16
  • align gpt-j layernorm to hf

    align gpt-j layernorm to hf

    Looking deeper into the gpt-j residual implementation I found a delta in the way layernorm(s) are applied. I don't see the point in applying two separate layer norm modules to the hidden_states (x)

    Compare the HF implementation. https://github.com/huggingface/transformers/blob/a94105f95fb66ee4129077c03e4e8a224f6a07fd/src/transformers/models/gptj/modeling_gptj.py#L279

    Is there a reason for having two layernorms? Am I completally off?

    opened by sweinbach 15
  • 13B Model Out of Memory with Single Node 8 A100 GPUs

    13B Model Out of Memory with Single Node 8 A100 GPUs

    Hi!

    Thanks for contribution making this repo available :)

    I tried to train the 13B model with micro batch size 1, model parallelism degree 8, but unable to get it to work. (always get OOM) The library advertises being able to scale up to 100B. What is required for this? I also tried deepspeed stage 3 with offload without using pipeline parallelism but that doesn't seem to work either. Please let me know what I'm missing. thanks!

    opened by benathi 14
  • Add support for Flash attention

    Add support for Flash attention

    This PR adds Tri Dao's Flash Attention as an optional backend for the global attention operation, enabled by setting the attention_config to [[["flash"], ...]. I've tested the changes in my own environment and consistently see a 2x boost for 4K sequence lengths in models ranging from 100M - 3B parameters.

    Maybe relevant: @tridao @lucidrains

    opened by VHellendoorn 13
  • Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together

    Pipeline parallelism and gradient checkpointing (edit: and ZeRO 2!) don’t work together

    Pipeline parallelism and gradient checkpointing both work when you use them individually. However when you turn them both on you get a mysterious KeyError: 0 from somewhere deep in DeepSpeed.

    bug 
    opened by StellaAthena 12
  • distributed training with multipy nodes.

    distributed training with multipy nodes.

    Hi, I want train a 13B model with 4 nodes, each node have 8*A100 GPUs, but I don't know how to run the code on my cluster, can you show me an example? I just run it on a single node successful.

    bug 
    opened by cdj0311 11
  • fix alibi inference shapes for cached layer_past

    fix alibi inference shapes for cached layer_past

    Restart the now reverted previous fix.

    History:

    • Old PR was merged, after which some (small?) differences in model output became apparent in discussion with @sdtblck
    • Old PR was reverted
    • This PR is opened to discuss the issue

    Tests and validations so far:

    Inference was tested on a trained neox checkpoint (model size ~4B, 60k steps trained). Random sampling is deactivated (no top_k, top_p, temperature).

    1. test NOT using recompute, i.e. with cache values used in the text generation (interactive)
    python deepy.py generate.py -d configs ... text_generation
    

    (copy pasted from terminal) Context prompt >>> Once upon a time Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => the model is working => generated text seems far from random (though not a fairytale :-) )

    1. test using recompute, i.e. with NO cache values used in the text generation (interactive)
    python deepy.py generate.py -d configs ... text_generation
    

    (copy pasted from terminal) Context prompt >>> Once upon a time Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => The generated text is exactly the same as above

    Tests and validations that resulted in discussions.

    The following line has been added to text_generation_utils.py to print logits. image

    1. test NOT using recompute, logit output

    (copy pasted from terminal) Context prompt >>> Once upon a time generated_tokens tensor(15, device='cuda:0') generated_token_logits tensor(42.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(37.4375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1711, device='cuda:0') generated_token_logits tensor(33.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2333, device='cuda:0') generated_token_logits tensor(36.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(42.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1635, device='cuda:0') generated_token_logits tensor(40.0312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(39.0938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6834, device='cuda:0') generated_token_logits tensor(36.5000, device='cuda:0', dtype=torch.float16) generated_tokens tensor(554, device='cuda:0') generated_token_logits tensor(40.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1287, device='cuda:0') generated_token_logits tensor(34.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(36.5938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(39.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2329, device='cuda:0') generated_token_logits tensor(34.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(45857, device='cuda:0') generated_token_logits tensor(33.4062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(348, device='cuda:0') generated_token_logits tensor(40.5312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(40126, device='cuda:0') generated_token_logits tensor(38.3125, device='cuda:0', dtype=torch.float16) generated_tokens tensor(280, device='cuda:0') generated_token_logits tensor(41.7188, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(41.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(5510, device='cuda:0') generated_token_logits tensor(36.5000, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6405, device='cuda:0') generated_token_logits tensor(41.2812, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(44.4062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(7384, device='cuda:0') generated_token_logits tensor(43.4688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(672, device='cuda:0') generated_token_logits tensor(45.3750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(46., device='cuda:0', dtype=torch.float16) generated_tokens tensor(779, device='cuda:0') generated_token_logits tensor(39.9062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(42.6875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2556, device='cuda:0') generated_token_logits tensor(36.8438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(3991, device='cuda:0') generated_token_logits tensor(42.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(30480, device='cuda:0') generated_token_logits tensor(40.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(17, device='cuda:0') generated_token_logits tensor(42.8125, device='cuda:0', dtype=torch.float16) Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => the generated text is the same as above

    1. test using recompute, logit output (copy pasted from terminal) Context prompt >>> Once upon a time generated_tokens tensor(15, device='cuda:0') generated_token_logits tensor(42.1562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(37.4375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1711, device='cuda:0') generated_token_logits tensor(33.6250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2333, device='cuda:0') generated_token_logits tensor(36.9062, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(42.1875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1635, device='cuda:0') generated_token_logits tensor(40.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(39.0938, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6834, device='cuda:0') generated_token_logits tensor(36.5312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(554, device='cuda:0') generated_token_logits tensor(40.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(1287, device='cuda:0') generated_token_logits tensor(35.0312, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(36.5625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(40.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2329, device='cuda:0') generated_token_logits tensor(34.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(45857, device='cuda:0') generated_token_logits tensor(33.3125, device='cuda:0', dtype=torch.float16) generated_tokens tensor(348, device='cuda:0') generated_token_logits tensor(40.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(40126, device='cuda:0') generated_token_logits tensor(38.2812, device='cuda:0', dtype=torch.float16) generated_tokens tensor(280, device='cuda:0') generated_token_logits tensor(41.6875, device='cuda:0', dtype=torch.float16) generated_tokens tensor(301, device='cuda:0') generated_token_logits tensor(41.1250, device='cuda:0', dtype=torch.float16) generated_tokens tensor(5510, device='cuda:0') generated_token_logits tensor(36.4688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(6405, device='cuda:0') generated_token_logits tensor(41.3750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(338, device='cuda:0') generated_token_logits tensor(44.3438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(7384, device='cuda:0') generated_token_logits tensor(43.2500, device='cuda:0', dtype=torch.float16) generated_tokens tensor(672, device='cuda:0') generated_token_logits tensor(45.2188, device='cuda:0', dtype=torch.float16) generated_tokens tensor(327, device='cuda:0') generated_token_logits tensor(46.0625, device='cuda:0', dtype=torch.float16) generated_tokens tensor(779, device='cuda:0') generated_token_logits tensor(39.8750, device='cuda:0', dtype=torch.float16) generated_tokens tensor(247, device='cuda:0') generated_token_logits tensor(42.6562, device='cuda:0', dtype=torch.float16) generated_tokens tensor(2556, device='cuda:0') generated_token_logits tensor(36.8438, device='cuda:0', dtype=torch.float16) generated_tokens tensor(3991, device='cuda:0') generated_token_logits tensor(42.9375, device='cuda:0', dtype=torch.float16) generated_tokens tensor(30480, device='cuda:0') generated_token_logits tensor(39.9688, device='cuda:0', dtype=torch.float16) generated_tokens tensor(17, device='cuda:0') generated_token_logits tensor(42.9375, device='cuda:0', dtype=torch.float16) Generated Text: , the only way to get a job at any of the many colleges and universities in the United States of America was to have a high school diploma

    => Generated text still the same => Logits are different. This is the question at hand

    Questions

    • Why are logits different?
    • Does the difference only occur for alibi? Is this a general issue if an issue at all?
    opened by sweinbach 11
  • Running through Dockerfile broken

    Running through Dockerfile broken

    Describe the bug When using an image based on the provided Dockerfile and running the quick start steps (download enron data, run deep.py), execution crashes before training begins.

    To Reproduce Steps to reproduce the behavior:

    1. Build an image using the provided Dockerfile
    2. Run said image, mounting 8 RTX800 GPUs
    3. Fetch enron data using the prepare_dataset.py script
    4. Run ./deepy.py pretrain_gpt2.py -d configs small.yml local_configs.yml
    5. The code crashes with a non-descript NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

    Expected behavior Training starts or a specific error is provided.

    Proposed solution The NCCL error is typically a stand-in for a real issue that is not relayed back through multiprocessing. As a first step, it would be nice to know if this setup works out-of-the-box for others; in that case, it might be my resources or CUDA version.

    Environment (please complete the following information):

    • GPUs: 8 RTX8000 GPUs
    • Configs: Ubuntu 20.04, Cuda 11.2

    Additional context Add any other context about the problem here.

    bug 
    opened by VHellendoorn 11
  • Create experiment runners

    Create experiment runners

    We will want to run experiments with a variety of configs and options. To enable this, we need two things:

    • [ ] configs files that we can use to specify the settings for a particular run
    • [ ] an experiment runner for managing and automatically executing several runs
    feature request good first issue 
    opened by StellaAthena 11
  • In interactive mode prompt length more than one word causes to crash

    In interactive mode prompt length more than one word causes to crash

    Describe the bug In interactive mode prompt length more than one word causes to crash. When I type just one word it generates text though.

    text_generation.yml

    ` { "text-gen-type": "interactive", "maximum_tokens": 500, "temperature": 0.9, "top_p": 0, "top_k": 0, "recompute": false, "num-samples": 10, "sample-input-file": "prompt.txt", "sample-output-file": "sample_output.txt", }

    `

    `Context prompt >>> Hello from Traceback (most recent call last): Traceback (most recent call last): File "generate.py", line 89, in File "generate.py", line 89, in main() File "generate.py", line 72, in main main() File "generate.py", line 72, in main generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 760, in generate_samples_interactive generate_samples_interactive( File "/gpt-neox/megatron/text_generation_utils.py", line 760, in generate_samples_interactive for ( File "/gpt-neox/megatron/text_generation_utils.py", line 316, in stream_tokens for ( File "/gpt-neox/megatron/text_generation_utils.py", line 316, in stream_tokens logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel) File "/gpt-neox/megatron/text_generation_utils.py", line 156, in forward_model logits = forward_model(model, model_inputs, neox_args.is_pipe_parallel) File "/gpt-neox/megatron/text_generation_utils.py", line 156, in forward_model loss, logits = model.eval_batch(model_inputs, return_logits=True) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 394, in eval_batch loss, logits = model.eval_batch(model_inputs, return_logits=True) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 394, in eval_batch self._exec_schedule(sched) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule self._exec_schedule(sched) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 1308, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 700, in _exec_forward_pass self._exec_instr(**cmd.kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 700, in _exec_forward_pass self.loss = self.loss_model(outputs, labels) File "/gpt-neox/megatron/model/gpt2_model.py", line 67, in cross_entropy losses = mpu.vocab_parallel_cross_entropy(output.float().contiguous(), labels) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 117, in vocab_parallel_cross_entropy self.loss = self.loss_model(outputs, labels) File "/gpt-neox/megatron/model/gpt2_model.py", line 67, in cross_entropy return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 63, in forward losses = mpu.vocab_parallel_cross_entropy(output.float().contiguous(), labels) predicted_logits_1d = logits_2d[arange_1d, masked_target_1d] File "/gpt-neox/megatron/mpu/cross_entropy.py", line 117, in vocab_parallel_cross_entropy

    IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [2], [3] return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target) File "/gpt-neox/megatron/mpu/cross_entropy.py", line 63, in forward predicted_logits_1d = logits_2d[arange_1d, masked_target_1d] IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [2], [3] Killing subprocess 7479 Killing subprocess 7480 Killing subprocess 7481 Killing subprocess 7482 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 179, in main() File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 169, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/usr/local/lib/python3.8/dist-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python', '-u', 'generate.py', '--local_rank=3', '--deepspeed_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 9.7e-05, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1260000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1260000000, "contiguous_gradients": true}, "steps_per_print": 2}', '--megatron_config', '{"train_batch_size": 128, "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 32, "optimizer": {"type": "Adam", "params": {"lr": 9.7e-05, "betas": [0.9, 0.95], "eps": 1e-08}}, "fp16": {"fp16": true, "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 12, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1260000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1260000000, "contiguous_gradients": true}, "steps_per_print": 2, "precision": "fp16", "num_layers": 44, "hidden_size": 6144, "num_attention_heads": 64, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "scaled_upper_triang_masked_softmax_fusion": true, "bias_gelu_fusion": true, "rotary_pct": 0.25, "init_method": "small_init", "output_layer_init_method": "wang_init", "gpt_j_residual": true, "gpt_j_tied": true, "output_layer_parallelism": "column", "lr_decay_style": "cosine", "lr_decay_iters": 150000, "min_lr": 9.7e-06, "optimizer_type": "Adam", "zero_stage": 1, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 1260000000, "zero_allgather_bucket_size": 1260000000, "lr": 9.7e-05, "tokenizer_type": "HFTokenizer", "data_path": "./data/pile_20B_tokenizer/pile_20B_tokenizer_text_document", "data_impl": "mmap", "save": "./20B_checkpoints", "config_files": {"20B.yml": "# DISCLAIMER: This is the configuration file for the GPT-NeoX-20B model as it was trained on 96x 40GB A100\n# GPUs. Depending on your system configuration, you may need to change some parameters in order to fit\n# the model in memory.\n\n{\n # Tokenizer / checkpoint settings - you will need to change these to the location you have them saved in\n \"vocab-file\": \"./20B_checkpoints/20B_tokenizer.json\",\n \"save\": \"./20B_checkpoints\",\n \"load\": \"./20B_checkpoints\",\n\n # If finetuning, edit the following to the location of your finetuning dataset:\n \"data-path\": \"./data/pile_20B_tokenizer/pile_20B_tokenizer_text_document\",\n\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 2,\n \"model-parallel-size\": 2,\n\n # model settings\n \"num-layers\": 44,\n \"hidden-size\": 6144,\n \"num-attention-heads\": 64,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"rotary_pct\": 0.25,\n \"no-weight-tying\": true,\n \"gpt_j_residual\": true,\n \"gpt_j_tied\": true,\n \"output_layer_parallelism\": \"column\",\n \"scaled-upper-triang-masked-softmax-fusion\": true,\n \"bias-gelu-fusion\": true,\n\n # init methods\n \"init_method\": \"small_init\",\n \"output_layer_init_method\": \"wang_init\",\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.97e-4,\n \"betas\": [0.9, 0.95],\n \"eps\": 1.0e-8,\n }\n },\n\n \"min_lr\": 0.97e-5,\n\n # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training\n \"zero_optimization\": {\n \"stage\": 1,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 1260000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 1260000000,\n \"contiguous_gradients\": True,\n },\n\n # batch / data settings (assuming 96 GPUs)\n \"train_micro_batch_size_per_gpu\": 4,\n \"gradient_accumulation_steps\": 32,\n \"data-impl\": \"mmap\",\n \"split\": \"995,4,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": false,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.01,\n \"hidden-dropout\": 0,\n \"attention-dropout\": 0,\n\n # precision settings\n \"fp16\": {\n \"fp16\": true,\n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"initial_scale_power\": 12,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 150000,\n \"lr-decay-iters\": 150000,\n\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"checkpoint-factor\": 500,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 2,\n \"steps_per_print\": 2,\n \"wall_clock_breakdown\": false,\n\n ### NEW DATA: ####\n \"tokenizer_type\": \"HFTokenizer\",\n \"tensorboard-dir\": \"./tensorboard\",\n \"log-dir\": \"./logs\",\n\n}\n", "text_generation_interactive.yml": "# Parameters used for text generation\n# Make sure load is specified somewhere else\n{\n # Text gen type: input-file, unconditional or interactive\n \"text-gen-type\": \"interactive\",\n\n # Params for all\n \"maximum_tokens\": 500,\n \"temperature\": 0.9,\n \"top_p\": 0,\n \"top_k\": 0,\n \"recompute\": false,\n\n # unconditional: samples\n \"num-samples\": 10,\n\n # input/output file\n \"sample-input-file\": \"prompt.txt\",\n \"sample-output-file\": \"sample_output.txt\",\n}\n"}, "load": "./20B_checkpoints", "checkpoint_factor": 500, "batch_size": 4, "train_iters": 150000, "eval_iters": 10, "split": "995,4,1", "vocab_file": "./20B_checkpoints/20B_tokenizer.json", "attention_dropout": 0, "hidden_dropout": 0, "checkpoint_activations": true, "synchronize_each_layer": true, "gas": 32, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 2, "model_parallel_size": 2, "is_pipe_parallel": true, "wandb_group": "72fp5jTbC3iYzFUHnE9Fh2_35wkyadj", "log_dir": "./logs", "tensorboard_dir": "./tensorboard", "log_interval": 2, "text_gen_type": "interactive", "temperature": 0.9, "maximum_tokens": 500, "sample_input_file": "prompt.txt", "sample_output_file": "sample_output.txt", "num_samples": 10, "user_script": "generate.py", "save_iters": [500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 9000, 9500, 10000, 10500, 11000, 11500, 12000, 12500, 13000, 13500, 14000, 14500, 15000, 15500, 16000, 16500, 17000, 17500, 18000, 18500, 19000, 19500, 20000, 20500, 21000, 21500, 22000, 22500, 23000, 23500, 24000, 24500, 25000, 25500, 26000, 26500, 27000, 27500, 28000, 28500, 29000, 29500, 30000, 30500, 31000, 31500, 32000, 32500, 33000, 33500, 34000, 34500, 35000, 35500, 36000, 36500, 37000, 37500, 38000, 38500, 39000, 39500, 40000, 40500, 41000, 41500, 42000, 42500, 43000, 43500, 44000, 44500, 45000, 45500, 46000, 46500, 47000, 47500, 48000, 48500, 49000, 49500, 50000, 50500, 51000, 51500, 52000, 52500, 53000, 53500, 54000, 54500, 55000, 55500, 56000, 56500, 57000, 57500, 58000, 58500, 59000, 59500, 60000, 60500, 61000, 61500, 62000, 62500, 63000, 63500, 64000, 64500, 65000, 65500, 66000, 66500, 67000, 67500, 68000, 68500, 69000, 69500, 70000, 70500, 71000, 71500, 72000, 72500, 73000, 73500, 74000, 74500, 75000, 75500, 76000, 76500, 77000, 77500, 78000, 78500, 79000, 79500, 80000, 80500, 81000, 81500, 82000, 82500, 83000, 83500, 84000, 84500, 85000, 85500, 86000, 86500, 87000, 87500, 88000, 88500, 89000, 89500, 90000, 90500, 91000, 91500, 92000, 92500, 93000, 93500, 94000, 94500, 95000, 95500, 96000, 96500, 97000, 97500, 98000, 98500, 99000, 99500, 100000, 100500, 101000, 101500, 102000, 102500, 103000, 103500, 104000, 104500, 105000, 105500, 106000, 106500, 107000, 107500, 108000, 108500, 109000, 109500, 110000, 110500, 111000, 111500, 112000, 112500, 113000, 113500, 114000, 114500, 115000, 115500, 116000, 116500, 117000, 117500, 118000, 118500, 119000, 119500, 120000, 120500, 121000, 121500, 122000, 122500, 123000, 123500, 124000, 124500, 125000, 125500, 126000, 126500, 127000, 127500, 128000, 128500, 129000, 129500, 130000, 130500, 131000, 131500, 132000, 132500, 133000, 133500, 134000, 134500, 135000, 135500, 136000, 136500, 137000, 137500, 138000, 138500, 139000, 139500, 140000, 140500, 141000, 141500, 142000, 142500, 143000, 143500, 144000, 144500, 145000, 145500, 146000, 146500, 147000, 147500, 148000, 148500, 149000, 149500], "global_num_gpus": 4}']' returned non-zero exit status 1. [email protected]`

    bug 
    opened by ahmedavid 0
  • Upstream DeepSpeed -> HF checkpoint conversion script update

    Upstream DeepSpeed -> HF checkpoint conversion script update

    This PR fixes #750 . Upstream DeepSpeed saves checkpoints in a new layout which is incompatible with the old conversion script. This makes convert_to_hf.py work with upstream DeepSpeed, and leaves legacy_convert_to_hf.py for conversion from DeeperSpeed.

    Draft for now because I need to test on a real model.

    opened by haileyschoelkopf 0
  • Upstream DeepSpeed breaks HF conversion script

    Upstream DeepSpeed breaks HF conversion script

    The tools/convert_to_hf.py script will need to be updated / a different version may need to be created for checkpoints saved with DeepSpeed. Checkpoints are no longer saved layer-by-layer, it seems, and now all weights are in several mp_rank_{MP_RANK}_model_states.pt files for each Model Parallel partition.

    Upstream DeepSpeed checkpoint:

    drwxr-xr-x 2 hailey eleuther     33280 Dec 18 14:29 configs
    -rw-r--r-- 1 hailey eleuther 810771646 Dec 18 14:29 mp_rank_00_model_states.pt
    -rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_0_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_1_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_2_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_3_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_4_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008143 Dec 18 14:29 zero_pp_rank_5_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608008079 Dec 18 14:29 zero_pp_rank_6_mp_rank_00_optim_states.pt
    -rw-r--r-- 1 hailey eleuther 608006863 Dec 18 14:29 zero_pp_rank_7_mp_rank_00_optim_states.pt
    

    DeeperSpeed checkpoint:

    drwxrwxrwx 2 hailey eleuther     33280 Nov 18 04:55 configs
    -rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_00-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_02-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_03-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_04-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_05-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_06-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_07-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_08-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_09-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_10-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_11-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_12-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_13-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_14-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_15-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_16-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_17-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_18-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_19-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_20-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_21-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_22-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_23-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_24-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 100720126 Nov 18 04:55 layer_25-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther      9127 Nov 18 04:55 layer_27-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther 206045931 Nov 18 04:55 layer_28-model_00-model_states.pt
    -rwxrwxrwx 1 hailey eleuther     16291 Nov 18 04:55 mp_rank_00_model_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_0_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_10_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_11_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_12_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_13_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_14_mp_rank_00_optim_states.pt
    -rwxrwxrwx 1 hailey eleuther 287605953 Nov 18 04:55 zero_pp_rank_15_mp_rank_00_optim_states.pt
    ...
    

    Updating the script shouldn't be too hard at all though.

    bug 
    opened by haileyschoelkopf 0
  • Issue deploying GPT-NeoX-20b on AWS Sagemaker with Jupyter Notebook

    Issue deploying GPT-NeoX-20b on AWS Sagemaker with Jupyter Notebook

    Describe the bug I get the following error upon trying to use predictor.predict(data) for AWS Sagemaker.

    ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
      "code": 400,
      "type": "InternalServerException",
      "message": "Could not load model /.sagemaker/mms/models/EleutherAI__gpt-neox-20b with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM\u0027\u003e)."
    }
    

    To Reproduce Steps to reproduce the behavior:

    1. Create Dockerfile
    2. Add the following into Dockerfile
    FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
    RUN pip install --upgrade 'transformers==4.25.1'
    RUN pip install --upgrade 'torch==1.13.0'
    
    1. Build the Dockerfile via command docker build -t gpt-neox . in the directory Dockerfille is in
    2. Create a file named dockerize.sh
    3. Add the following content into the file
    %%sh
    
    # Specify an algorithm name
    algorithm_name=gpt-neox
    
    account=$(aws sts get-caller-identity --query Account --output text)
    
    # Get the region defined in the current configuration (default to us-west-2 if none defined)
    region=$(aws configure get region)
    
    fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
    
    # If the repository doesn't exist in ECR, create it.
    
    aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
    if [ $? -ne 0 ]
    then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
    fi
    
    # Log into Docker
    aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}
    
    # Build the docker image locally with the image name and then push it to ECR
    # with the full name.
    
    docker build -t ${algorithm_name} .
    docker tag ${algorithm_name} ${fullname}
    
    docker push ${fullname}
    
    1. Run command docker login (you need docker cli)
    2. Execute the shell script file (you need aws cli)
    3. Open Jupyter Notebook
    4. Add the following to Jupyter Notebook
    %pip install sagemaker
    %pip install boto3
    
    from sagemaker.huggingface import HuggingFaceModel
    import boto3
    
    client=boto3.client('sts')
    account=client.get_caller_identity()['Account']
    
    my_session=boto3.session.Session()
    region=my_session.region_name
    
    algorithm_name="gpt-neox"
    tag="latest"
    ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, algorithm_name, tag)
    
    role = 'SageMaker'
    
    hub = {
        'HF_MODEL_ID':'EleutherAI/gpt-neox-20b',
        'HF_TASK':'text-generation'
    }
    
    huggingface_model = HuggingFaceModel(
        image_uri=ecr_image,
        env=hub,
        role=role,
    #     transformers_version="4.17", these are not needed anymore
    #     pytorch_version="1.10",
    #     py_version="py38",
    )
    
    predictor = huggingface_model.deploy(
        initial_instance_count=1,
        instance_type="ml.m5.xlarge"
    )
    
    1. Run this final command in Jupyter Notebook once predictor is done.
    predictor.predict({
        "inputs": "The weather is"
    })
    
    1. See error

    Expected behavior I expect to see a generated query from the input using the 20 billion parameters pretrained EleutherAI model.

    Proposed solution I suspect I could fix this issue if I all together ditched the huggingface Sagemaker library. Also, the model hasn't been updated for the last 8 months, so I'm not sure if it is due to that.

    Environment (please complete the following information):

    • GPUs: none
    • Configs: unsure

    Additional context I have tried the other versions of GPT-Neo's like 125M and 2.7B and those have worked perfectly. The reason that I need to extend the Docker container for AWS is to not get another error which is apparently an issue with the latest version of transformers (4.17?) on the default docker is not up to date enough.

    bug 
    opened by BjornTheProgrammer 1
  • Model ckpts from `DeeperSpeed` cannot be loaded using `deepspeed_main`/upstream DeepSpeed

    Model ckpts from `DeeperSpeed` cannot be loaded using `deepspeed_main`/upstream DeepSpeed

    Describe the bug

    Using DeeperSpeed-trained model checkpoints (git+https://github.com/EleutherAI/DeeperSpe[email protected]#egg=deepspeed ),

    loading them raises an error when trying to use deepspeed_main with the upstream DeepSpeed.

    To Reproduce Steps to reproduce the behavior:

    Train a model from main branch using DeeperSpeed (or download model ckpt from s-eai-neox/pythia/1.3B/global_step71500)

    Try to load this checkpoint using the deepspeed_main branch and upstream Deepspeed (for either training or evaluation), gives the following error:

    Traceback (most recent call last):
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/evaluate.py", line 76, in <module>
        main()
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/evaluate.py", line 35, in main
        model, neox_args = setup_for_inference_or_eval(use_cache=False)
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/utils.py", line 440, in setup_for_inference_or_eval
        model, _, _ = setup_model_and_optimizer(
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/training.py", line 437, in setup_model_and_optimizer
        neox_args.iteration = load_checkpoint(
      File "/fsx/hailey/deepspeed-main-neox/gpt-neox/megatron/checkpointing.py", line 235, in load_checkpoint
        checkpoint_name, state_dict = model.load_checkpoint(
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2647, in load_checkpoint
        load_path, client_states = self._load_checkpoint(load_dir,
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2713, in _load_checkpoint
        self.load_module_state_dict(state_dict=checkpoint['module'],
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2507, in load_module_state_dict
        self.module.load_state_dict(state_dict, # TODO
      File "/fsx/shiv/torchtest/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1620, in load_state_dict
        raise TypeError("Expected state_dict to be dict-like, got {}.".format(type(state_dict)))
    TypeError: Expected state_dict to be dict-like, got <class 'NoneType'>.
    

    This gives the above traceback and checkpoint loading fails.

    Expected behavior The checkpoints should be loadable by either Deepspeed version, ideally.

    Proposed solution This could be an issue with Deepspeed checkpoint formats changing over the course of 4 versions--not sure yet.

    Additional context Relevant to merging #663 since we have checkpoints we want to use trained in DeeperSpeed.

    cc @Quentin-Anthony @dashstander @StellaAthena

    bug 
    opened by haileyschoelkopf 3
Releases(legacy_gptj_residual.1.0.0)
Owner
EleutherAI
EleutherAI
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
Clone a voice in 5 seconds to generate arbitrary speech in real-time

This repository is forked from Real-Time-Voice-Cloning which only support English. English | 中文 Features 🌍 Chinese supported mandarin and tested with

Weijia Chen 25.6k Jan 06, 2023
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 05, 2021
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022
Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

gpt3-instruct-sandbox Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API Description This project updates an existing GPT-3 san

312 Jan 03, 2023
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 08, 2023
Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

THUNLP 1.3k Dec 30, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

BADER ALABDAN 2 Oct 22, 2022
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

41 Jan 03, 2023
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Dec 31, 2022
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
DVC-NLP-Simple-usecase

dvc-NLP-simple-usecase DVC NLP project Reference repository: official reference repo DVC STUDIO MY View Bag of Words- Krish Naik TF-IDF- Krish Naik ST

SUNNY BHAVEEN CHANDRA 2 Oct 02, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
MEDIALpy: MEDIcal Abbreviations Lookup in Python

A small python package that allows the user to look up common medical abbreviations.

Aberystwyth Systems Biology 7 Nov 09, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Advent-of-cyber-2019-writeup This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe https://tryhackme.com/shivam007/badges/c

shivam danawale 5 Jul 17, 2022