Add files using large-upload tool
Browse files
README.md
CHANGED
@@ -27,7 +27,7 @@ NVIDIA does not claim ownership to any outputs generated using the Models or Der
|
|
27 |
|
28 |
### Intended use
|
29 |
|
30 |
-
Nemotron-4-340B-Base is a completion model intended for use in over 50+ natural and 40+ coding languages. For best performance on a given task, users are encouraged to customize the completion model using the NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/Steer-LM/RLHF using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner).
|
31 |
|
32 |
**Model Developer:** NVIDIA
|
33 |
|
@@ -59,7 +59,7 @@ Nemotron-4
|
|
59 |
|
60 |
### Usage
|
61 |
|
62 |
-
1. We will spin up an inference server and then call the inference server in a python script. Let’s first define the python script ``call_server.py
|
63 |
|
64 |
```python
|
65 |
import requests
|
@@ -101,7 +101,7 @@ print(response)
|
|
101 |
```
|
102 |
|
103 |
|
104 |
-
2. Given this python script, we will create a bash script, which spins up the inference server within the NeMo container(docker pull nvcr.io/nvidia/nemo:24.01.framework) and calls the python script ``call_server.py``. The bash script ``nemo_inference.sh`` is as follows,
|
105 |
|
106 |
|
107 |
```bash
|
@@ -151,13 +151,13 @@ depends_on () {
|
|
151 |
```
|
152 |
|
153 |
|
154 |
-
3
|
155 |
|
156 |
```bash
|
157 |
#!/bin/bash
|
158 |
#SBATCH -A SLURM-ACCOUNT
|
159 |
#SBATCH -p SLURM-PARITION
|
160 |
-
#SBATCH -N 2
|
161 |
#SBATCH -J generation
|
162 |
#SBATCH --ntasks-per-node=8
|
163 |
#SBATCH --gpus-per-node=8
|
@@ -167,8 +167,9 @@ RESULTS=<PATH_TO_YOUR_SCRIPTS_FOLDER>
|
|
167 |
OUTFILE="${RESULTS}/slurm-%j-%n.out"
|
168 |
ERRFILE="${RESULTS}/error-%j-%n.out"
|
169 |
MODEL=<PATH_TO>/Nemotron-4-340B-Base
|
170 |
-
|
171 |
MOUNTS="--container-mounts=<PATH_TO_YOUR_SCRIPTS_FOLDER>:/scripts,MODEL:/model"
|
|
|
172 |
read -r -d '' cmd <<EOF
|
173 |
bash /scripts/nemo_inference.sh /model
|
174 |
EOF
|
|
|
27 |
|
28 |
### Intended use
|
29 |
|
30 |
+
Nemotron-4-340B-Base is a completion model intended for use in over 50+ natural and 40+ coding languages. For best performance on a given task, users are encouraged to customize the completion model using the [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html) suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/Steer-LM/RLHF using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner).
|
31 |
|
32 |
**Model Developer:** NVIDIA
|
33 |
|
|
|
59 |
|
60 |
### Usage
|
61 |
|
62 |
+
1. We will spin up an inference server and then call the inference server in a python script. Let’s first define the python script ``call_server.py``.
|
63 |
|
64 |
```python
|
65 |
import requests
|
|
|
101 |
```
|
102 |
|
103 |
|
104 |
+
2. Given this python script, we will create a bash script, which spins up the inference server within the NeMo container (```docker pull nvcr.io/nvidia/nemo:24.01.framework```) and calls the python script ``call_server.py``. The bash script ``nemo_inference.sh`` is as follows,
|
105 |
|
106 |
|
107 |
```bash
|
|
|
151 |
```
|
152 |
|
153 |
|
154 |
+
3. We can launch the ``nemo_inferece.sh`` with a slurm script defined like below, which starts a 2-node job for the model inference.
|
155 |
|
156 |
```bash
|
157 |
#!/bin/bash
|
158 |
#SBATCH -A SLURM-ACCOUNT
|
159 |
#SBATCH -p SLURM-PARITION
|
160 |
+
#SBATCH -N 2
|
161 |
#SBATCH -J generation
|
162 |
#SBATCH --ntasks-per-node=8
|
163 |
#SBATCH --gpus-per-node=8
|
|
|
167 |
OUTFILE="${RESULTS}/slurm-%j-%n.out"
|
168 |
ERRFILE="${RESULTS}/error-%j-%n.out"
|
169 |
MODEL=<PATH_TO>/Nemotron-4-340B-Base
|
170 |
+
CONTAINER="nvcr.io/nvidia/nemo:24.01.framework"
|
171 |
MOUNTS="--container-mounts=<PATH_TO_YOUR_SCRIPTS_FOLDER>:/scripts,MODEL:/model"
|
172 |
+
|
173 |
read -r -d '' cmd <<EOF
|
174 |
bash /scripts/nemo_inference.sh /model
|
175 |
EOF
|