Issue
Currently the transform method of the BERTopic has to recieve a list of strings as the documents and if embeddings are present then it needs to be a 2darray of shape (len(documents, embedding_dimension). This requires extra reshape calls when trying to call the transform method on a single document.
For example it would be nicer to do this:
document = 'This is a really interesting document'
document_embedding = np.random.rand(768)
topic_model.transform(document, document_embedding)
Example of what you currently have to do
document = 'This is a really interesting document'
document_embedding = np.random.rand(768)
topic_model.transform([document], document_embedding.reshape(1,-1))
Currently if you run it you get this error.
BFile ~/code/BERTopic/bertopic/_utils.py:55, in check_embeddings_shape(embeddings, docs)
else:
if embeddings.shape[0] != len(docs):
---> raise ValueError("Make sure that the embeddings are a numpy array with shape: "
"(len(docs), vector_dim) where vector_dim is the dimensionality "
"of the vector embeddings. ")
ValueError: Make sure that the embeddings are a numpy array with shape: (len(docs), vector_dim) where vector_dim is the dimensionality of the vector embeddings.
Why I think it is happening
This is because the embeddings shape check inside transform happens before:
- the
document argument is converted into a list if it is str
- The embedding is reshaped from a 1d array to 2d array (Note that it currently never does this).
Thoughts on making it better
Am I missing something here about np arrays or the transform function? It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.
I have already made the change on a fork for my uses did you agree with the idea and would you like a PR?
Issue
Currently the
transformmethod of theBERTopichas to recieve a list of strings as the documents and if embeddings are present then it needs to be a 2darray of shape (len(documents, embedding_dimension). This requires extra reshape calls when trying to call thetransformmethod on a single document.For example it would be nicer to do this:
Example of what you currently have to do
Currently if you run it you get this error.
Why I think it is happening
This is because the embeddings shape check inside
transformhappens before:documentargument is converted into a list if it is strThoughts on making it better
Am I missing something here about np arrays or the transform function? It seems to me that wanting to transform a single document is common enough that the function could have that little bit of extra functionality.
I have already made the change on a fork for my uses did you agree with the idea and would you like a PR?