Embedding
This module contains classes and functions for converting words or text into numerical vectors.
WordEmbedding
¶
This class includes functions for converting words into numerical vectors.
Examples:
>>> # Load from Hugging Face Hub
>>> wordEmbedding = WordEmbedding.load(repo_id='roshan-research/hazm-word-embedding', model_filename='fasttext_skipgram_300.bin', model_type='fasttext')
>>> # Or load from a local model file
>>> # wordEmbedding = WordEmbedding.load(model_path='fasttext_skipgram_300.bin', model_type='fasttext')
__getitem__(word)
¶
Returns the vector for the given word.
__init__(model, model_type)
¶
Constructor.
doesnt_match(words)
¶
Finds the word that does not match the others in the list.
Examples:
>>> wordEmbedding.doesnt_match(['سلام', 'درود', 'خداحافظ', 'پنجره'])
'پنجره'
get_normal_vector(word)
¶
Returns the normalized vector for the given word.
get_vector_size()
¶
Returns the size of the word vectors.
get_vectors()
¶
Returns the matrix of word vectors.
get_vocab_to_index()
¶
Returns a dictionary mapping words to their indices.
get_vocabs()
¶
Returns the list of vocabulary words.
load(model_path=None, model_type='fasttext', repo_id=None, model_filename=None)
classmethod
¶
Factory method to load the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
str | Path | None
|
Path to the model file. |
None
|
model_type
|
str
|
Type of the model ('fasttext', 'keyedvector', or 'glove'). |
'fasttext'
|
repo_id
|
str | None
|
Hugging Face repository ID. |
None
|
model_filename
|
str | None
|
Filename in the Hugging Face repository. |
None
|
Returns:
| Type | Description |
|---|---|
WordEmbedding
|
An instance of WordEmbedding. |
nearest_words(word, topn=5)
¶
Finds the nearest words to the given word.
Examples:
>>> wordEmbedding.nearest_words('ایران', topn=5)
[('کشور', 0.8735059499740601), ...]
similarity(word1, word2)
¶
Calculates the similarity between two words.
Examples:
>>> wordEmbedding.similarity('ایران', 'آلمان')
0.72231203
train(dataset_path, workers=multiprocessing.cpu_count() - 1, vector_size=200, epochs=10, min_count=5, fasttext_type='skipgram', dest_path='fasttext_word2vec_model.model')
¶
Trains the model using Gensim FastText.
Examples:
>>> wordEmbedding.train(dataset_path='dataset.txt', workers=4, vector_size=300, epochs=30, fasttext_type='cbow')
SentEmbedding
¶
Converts sentences to vectors.
Examples:
>>> # Load from Hugging Face Hub
>>> sentEmbedding = SentEmbedding.load(repo_id='roshan-research/hazm-sent-embedding', model_filename='sent2vec-naab.model')
>>> # Or load from a local model file
>>> # sentEmbedding = SentEmbedding.load(model_path='sent2vec-naab.model')
__init__(model=None)
¶
Constructor.
get_sentence_vector(sent)
¶
Returns the vector for the given sentence.
Examples:
>>> result = sentEmbedding.get_sentence_vector('این متن به برداری تبدیل خواهد شد')
get_vector_size()
¶
Returns the size of the sentence vectors.
load(model_path=None, repo_id=None, model_filename=None)
classmethod
¶
Factory method to load the model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
str | Path | None
|
Path to the model file. |
None
|
repo_id
|
str | None
|
Hugging Face repository ID. |
None
|
model_filename
|
str | None
|
Filename in the Hugging Face repository. |
None
|
Returns:
| Type | Description |
|---|---|
SentEmbedding
|
An instance of SentEmbedding. |
similarity(sent1, sent2)
¶
Calculates the similarity between two sentences.
Examples:
>>> result = sentEmbedding.similarity('شیر حیوانی وحشی است', 'پلنگ از دیگر جانوران درنده است')
train(dataset_path, min_count=5, workers=multiprocessing.cpu_count() - 1, windows=5, vector_size=300, epochs=10, dest_path='gensim_sent2vec.model')
¶
Trains the model using Gensim Doc2Vec.
SentenceEmbeddingCorpus
¶
Iterate over dataset for Doc2Vec training.