-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Here is the link to my repo:
https://github.com/salman1993/simple-qa-on-kb/tree/master/clean/relation_prediction
The model code is very similar to the SNLI model example. The dataset may need to be downloaded - did not test out that part of the code.
Link to the dataset - https://www.dropbox.com/s/tohrsllcfy7rch4/SimpleQuestions_v2.tgz
I run this code on my Mac and it is running fine. However, the code fails on Linux, both on CPU and GPU. I am running PyTorch version - '0.1.12_2'
You can run the code with:
python train.py
python train.py --cuda
This is the error I get on CPU:
$ python train.py
WARNING: You have CUDA but not using it.
root path for relation dataset: ../data
Namespace(batch_size=2, birnn=True, clip_gradient=0.5, cuda=False, d_embed=300, d_hidden=400, d_out=1838, data_cache='/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/data_cache', dev_every=1, dropout_prob=0.3, epochs=40, fix_emb=True, gpu=-1, log_every=1, lr=1e-05, n_cells=4, n_embed=74820, n_layers=2, patience=10, resume_snapshot='', save_every=100, save_path='saved_checkpoints', seed=1111, test=False, vector_cache='/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/vector_cache/input_vectors.pt', word_vectors='glove.42B')
Time Epoch Iteration Progress (%Epoch) Loss Dev/Loss Accuracy Dev/Accuracy
Traceback (most recent call last):
File "train.py", line 100, in <module>
answer = model(batch)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/model.py", line 50, in forward
question_embed = self.embed(batch.question)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 94, in forward
)(input, self.weight)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/thnn/sparse.py", line 53, in forward
output = torch.index_select(weight, 0, indices.view(-1))
RuntimeError: index out of range at /py/conda-bld/pytorch_1493680494901/work/torch/lib/TH/generic/THTensorMath.c:273
On the GPU, the error is different:
$ python train.py --cuda
root path for relation dataset: ../data
Namespace(batch_size=2, birnn=True, clip_gradient=0.5, cuda=True, d_embed=300, d_hidden=400, d_out=1838, data_cache='/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/data_cache', dev_every=1, dropout_prob=0.3, epochs=40, fix_emb=True, gpu=0, log_every=1, lr=1e-05, n_cells=4, n_embed=74820, n_layers=2, patience=10, resume_snapshot='', save_every=100, save_path='saved_checkpoints', seed=1111, test=False, vector_cache='/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/vector_cache/input_vectors.pt', word_vectors='glove.42B')
Time Epoch Iteration Progress (%Epoch) Loss Dev/Loss Accuracy Dev/Accuracy
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [51,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
File "train.py", line 100, in <module>
answer = model(batch)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/model.py", line 54, in forward
question_encoded = self.encoder(question_embed)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/u/s43mohammed/dev/simple-qa-on-kb/clean/relation_prediction/model.py", line 20, in forward
outputs, (ht, ct) = self.rnn(inputs, (h0, c0))
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
result = self.forward(*input, **kwargs)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 91, in forward
output, hidden = func(input, self.all_weights, hx)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 343, in forward
return func(input, *fargs, **fkwargs)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 202, in _do_forward
flat_output = super(NestedIOFunction, self)._do_forward(*flat_input)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 224, in forward
result = self.forward_extended(*nested_tensors)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 285, in forward_extended
cudnn.rnn.forward(self, input, hx, weight, output, hy)
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/rnn.py", line 190, in forward
handle = cudnn.get_handle()
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 262, in get_handle
handle = CuDNNHandle()
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 81, in __init__
check_error(lib.cudnnCreate(ctypes.byref(ptr)))
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 249, in check_error
raise CuDNNError(status)
torch.backends.cudnn.CuDNNError: 4: b'CUDNN_STATUS_INTERNAL_ERROR'
Exception ignored in: <bound method CuDNNHandle.__del__ of <torch.backends.cudnn.CuDNNHandle object at 0x7f540b0b8550>>
Traceback (most recent call last):
File "/u5/s43mohammed/anaconda3/lib/python3.6/site-packages/torch/backends/cudnn/__init__.py", line 85, in __del__
check_error(lib.cudnnDestroy(self))
ctypes.ArgumentError: argument 1: <class 'TypeError'>: Don't know how to convert parameter 1
I tried uninstalling PyTorch and reinstalling it as well but did not work.