Skip to content

Program runs fine on Mac but not on Linux (both CPU and GPU problems) #1998

@salman1993

Description

@salman1993

Here is the link to my repo:
https://github.com/salman1993/simple-qa-on-kb/tree/master/clean/relation_prediction

The model code is very similar to the SNLI model example. The dataset may need to be downloaded - did not test out that part of the code.
Link to the dataset - https://www.dropbox.com/s/tohrsllcfy7rch4/SimpleQuestions_v2.tgz

I run this code on my Mac and it is running fine. However, the code fails on Linux, both on CPU and GPU. I am running PyTorch version - '0.1.12_2'
You can run the code with:

python train.py 
python train.py --cuda

This is the error I get on CPU:

$ python train.py
WARNING: You have CUDA but not using it.
root path for relation dataset: ../data
Namespace(batch_size=2, birnn=True, clip_gradient=0.5, cuda=False, d_embed=300, d_hidden=400, d_out=1838, data_cache='/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/data_cache', dev_every=1, dropout_prob=0.3, epochs=40, fix_emb=True, gpu=-1, log_every=1, lr=1e-05, n_cells=4, n_embed=74820, n_layers=2, patience=10, resume_snapshot='', save_every=100, save_path='saved_checkpoints', seed=1111, test=False, vector_cache='/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/vector_cache/input_vectors.pt', word_vectors='glove.42B')
  Time Epoch Iteration Progress    (%Epoch)   Loss   Dev/Loss     Accuracy  Dev/Accuracy
Traceback (most recent call last):
  File "train.py", line 100, in <module>
    answer = model(batch)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/model.py", line 50, in forward
    question_embed = self.embed(batch.question)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 94, in forward
    )(input, self.weight)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/thnn/sparse.py", line 53, in forward
    output = torch.index_select(weight, 0, indices.view(-1))
RuntimeError: index out of range at /py/conda-bld/pytorch_1493680494901/work/torch/lib/TH/generic/THTensorMath.c:273

On the GPU, the error is different:

$ python train.py --cuda
root path for relation dataset: ../data
Namespace(batch_size=2, birnn=True, clip_gradient=0.5, cuda=True, d_embed=300, d_hidden=400, d_out=1838, data_cache='/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/data_cache', dev_every=1, dropout_prob=0.3, epochs=40, fix_emb=True, gpu=0, log_every=1, lr=1e-05, n_cells=4, n_embed=74820, n_layers=2, patience=10, resume_snapshot='', save_every=100, save_path='saved_checkpoints', seed=1111, test=False, vector_cache='/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/vector_cache/input_vectors.pt', word_vectors='glove.42B')
  Time Epoch Iteration Progress    (%Epoch)   Loss   Dev/Loss     Accuracy  Dev/Accuracy
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "train.py", line 100, in <module>
    answer = model(batch)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/model.py", line 54, in forward
    question_encoded = self.encoder(question_embed)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/model.py", line 20, in forward
    outputs, (ht, ct) = self.rnn(inputs, (h0, c0))
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 91, in forward
    output, hidden = func(input, self.all_weights, hx)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 343, in forward
    return func(input, *fargs, **fkwargs)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 202, in _do_forward
    flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 224, in forward
    result = self.forward_extended(*nested_tensors)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 285, in forward_extended
    cudnn.rnn.forward(self, input, hx, weight, output, hy)
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 190, in forward
    handle = cudnn.get_handle()
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 262, in get_handle
    handle = CuDNNHandle()
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 81, in __init__
    check_error(lib.cudnnCreate(ctypes.byref(ptr)))
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 249, in check_error
    raise CuDNNError(status)
torch.backends.cudnn.CuDNNError: 4: b'CUDNN_STATUS_INTERNAL_ERROR'
Exception ignored in: <bound method CuDNNHandle.__del__ of <torch.backends.cudnn.CuDNNHandle object at 0x7f540b0b8550>>
Traceback (most recent call last):
  File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 85, in __del__
    check_error(lib.cudnnDestroy(self))
ctypes.ArgumentError: argument 1: <class 'TypeError'>: Don't know how to convert parameter 1

I tried uninstalling PyTorch and reinstalling it as well but did not work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions