Skip to content

Embedding

This module contains classes and functions for converting words or text into numerical vectors.

WordEmbedding

This class includes functions for converting words into numerical vectors.

Examples:

>>> # Load from Hugging Face Hub
>>> wordEmbedding = WordEmbedding.load(repo_id='roshan-research/hazm-word-embedding', model_filename='fasttext_skipgram_300.bin', model_type='fasttext')
>>> # Or load from a local model file
>>> # wordEmbedding = WordEmbedding.load(model_path='fasttext_skipgram_300.bin', model_type='fasttext')

__getitem__(word)

Returns the vector for the given word.

__init__(model, model_type)

Constructor.

doesnt_match(words)

Finds the word that does not match the others in the list.

Examples:

>>> wordEmbedding.doesnt_match(['سلام', 'درود', 'خداحافظ', 'پنجره'])
'پنجره'

get_normal_vector(word)

Returns the normalized vector for the given word.

get_vector_size()

Returns the size of the word vectors.

get_vectors()

Returns the matrix of word vectors.

get_vocab_to_index()

Returns a dictionary mapping words to their indices.

get_vocabs()

Returns the list of vocabulary words.

load(model_path=None, model_type='fasttext', repo_id=None, model_filename=None) classmethod

Factory method to load the model.

Parameters:

Name Type Description Default
model_path str | Path | None

Path to the model file.

None
model_type str

Type of the model ('fasttext', 'keyedvector', or 'glove').

'fasttext'
repo_id str | None

Hugging Face repository ID.

None
model_filename str | None

Filename in the Hugging Face repository.

None

Returns:

Type Description
WordEmbedding

An instance of WordEmbedding.

nearest_words(word, topn=5)

Finds the nearest words to the given word.

Examples:

>>> wordEmbedding.nearest_words('ایران', topn=5)
[('کشور', 0.8735059499740601), ...]

similarity(word1, word2)

Calculates the similarity between two words.

Examples:

>>> wordEmbedding.similarity('ایران', 'آلمان')
0.72231203

train(dataset_path, workers=multiprocessing.cpu_count() - 1, vector_size=200, epochs=10, min_count=5, fasttext_type='skipgram', dest_path='fasttext_word2vec_model.model')

Trains the model using Gensim FastText.

Examples:

>>> wordEmbedding.train(dataset_path='dataset.txt', workers=4, vector_size=300, epochs=30, fasttext_type='cbow')

SentEmbedding

Converts sentences to vectors.

Examples:

>>> # Load from Hugging Face Hub
>>> sentEmbedding = SentEmbedding.load(repo_id='roshan-research/hazm-sent-embedding', model_filename='sent2vec-naab.model')
>>> # Or load from a local model file
>>> # sentEmbedding = SentEmbedding.load(model_path='sent2vec-naab.model')

__init__(model=None)

Constructor.

get_sentence_vector(sent)

Returns the vector for the given sentence.

Examples:

>>> result = sentEmbedding.get_sentence_vector('این متن به برداری تبدیل خواهد شد')

get_vector_size()

Returns the size of the sentence vectors.

load(model_path=None, repo_id=None, model_filename=None) classmethod

Factory method to load the model.

Parameters:

Name Type Description Default
model_path str | Path | None

Path to the model file.

None
repo_id str | None

Hugging Face repository ID.

None
model_filename str | None

Filename in the Hugging Face repository.

None

Returns:

Type Description
SentEmbedding

An instance of SentEmbedding.

similarity(sent1, sent2)

Calculates the similarity between two sentences.

Examples:

>>> result = sentEmbedding.similarity('شیر حیوانی وحشی است', 'پلنگ از دیگر جانوران درنده است')

train(dataset_path, min_count=5, workers=multiprocessing.cpu_count() - 1, windows=5, vector_size=300, epochs=10, dest_path='gensim_sent2vec.model')

Trains the model using Gensim Doc2Vec.

SentenceEmbeddingCorpus

Iterate over dataset for Doc2Vec training.