Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Comment three: after hacking my way around that, the script ran for another few hours doing disk IE (writing the offload) such that my directory is now 1.3T (330GB .git/lfs/objects, 330GB sharded weights, and 607G of "offload".

Then it failed again:

    File "/home/dek/miniconda3/lib/python3.9/site-packages/accelerate/big_modeling.py", line 188, in dispatch_model
      main_device = [d for d in device_map.values() if d not in ["cpu", "disk"] [0]
  IndexError: list index out of range
Now I'm curious just how long it will take to repro this error (IE, running again, with the offload files aalready written). It's also puzzling since I set the device_map to 'auto'.

Again, all par for the course and stuff I expected (having worked in HPC/ML/science for 3 decades, you get used to research codes).



Huh. That one looks like it wasn't able to place any part of the model on the GPU. I know there is currently an issue where if the first layer of the model is too big to fit on the GPU it will put all layers on the CPU/disk (rather than trying to see if a later layer would fit). But the 3080Ti has 12GB so I'd be surprised if that's happening here?


I see the same error as well. Looking at the underlying accelerate code, it looks like if GPU is not an option then CPU and disc are completely rejected and the above error is thrown.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: