Oren Bochman’s Blog

Recently I’ve been revisiting an old idea I had about creating a topological model of alignment. This was based on work on Wikipedia where many hints are available (Internal links, External links, Crosslanguage links, Wikidata, Citation, templates etc.) And where text are often translated by more often missing or different. Over time I became aware of more resources that could be used, movie subtitles, books translations, parallel corpora. However news articles and blogs are often not parallel but can contain lots of similar information.

If we look at the twitter feeds we can see that people often tweet the same news in different languages. This is a good source of parallel data. We can probably run this through a translation model and then use the output to learn alignment and segmentation. It should be even more usefull if we capture sequences of tweets that are about the same news item. In this case we might look at aligning or predicting emojis.

Ideally one would like to do unsupervised learning of alignment and segmentation. By simply deleting parts of one documents and then trying to predict the missing parts using the other document. The model would be able to learn to do this better by learning to segment and align the documents.

Another interesting idea is to learn ancillary representation for alignment and segmentation for each language. This is an idea i got from my work on language evolution. Instead of trying to learn the whole grammar we might try to model the most common short constructs in each language. With a suitable loss function we might might find a pragmatic representation that is useful for alignment and segmentation for a language pair. Ofcourse such representations would be useful for other tasks as well.

This might be much easier if we provide decent sized chunks for training. We might also first use very similar documents (from a parallel corpus) and later move to new articles or papers that are more loosely related.

Segmentation and Alignment are two related tasks that are often done together and in this abstract view more widely applicable than just in translation e.g. DNA and time series. However this post will focus primarily on translation.

I guess the algorithm should need to:

find a segment in the source, and decide if

there is a similar segment in the other text.
there are multiple segments that match. (due to mophology, metaphor, or lack of a specific word in the target language)
the segment is missing in the other text.
a conflicting segment is present in the other text.
if the segment is a non text segment (markup, templates, images, tables, etc.)
if the segment is a named entity or a place name that requires transliteration or lookup in a ‘knowledge base’

The original idea was to use these hints to learn to align the documents at a rough level by providing a rough topology for each document. The open sets would be mappable to each other. They could then be concatenated to learn Latent Semantic Alignment or Latent Dirichlet Allocation.

Toppologies can then be refined by using cross language word models on the ssegements deemed to be similar.

One tool that might be available today is to use cross language word embeddings. These should allow to align the documents at a much finer level.

Word embeddings will often not be available for all words such as names, places, etc. This is where the hints come in. A second tool that can help here is a to lern translitiration models.

A second notion is to develop phrase embeddings. These could be used to better handle one to many mappings that arise from the differences in morphology between languages.

A second idea is that once we have alignments we can learn pooling priors for different constructs and achieve better defaults for translation.

The Phrase embeddings might have have combine a simple structural representation and a semantic representation. The structural representation would be used to align the phrases and the semantic representation would be used to align the words within the phrases. The semantic representation would be grounded in the same high dimensional semantic space as the word embeddings.

BITEXT AND ALIGNMENT

A bitext B = (Bsrc , Btrg ) is a pair of texts B_{src} and B_{trg} that correspond to each other.

bitext

B_{src} = (s_1 , ..., s_N ) and B_{trg} = (t_1 , .., t_M )

Empty elements can be added to the source and target sentences to allow for empty alignments corresponding to deletions/insertions.

Empty elements

(p || r) = (s_{x1} , .., s_{xI} )||(t_{y1} , .., t_{yJ} ) with 1 ≤ x_i ≤ N for all i = 1..I and 1 ≤ y_j ≤ M for all j = 1..J

An alignment A is then the set of bisegments for the entire bitext.

This should be a bijection, but it is not always the case.

bitext links L = l_1 , .., l_K which describe such mappings between elements s_x and s_y : l_k = (x, y) with 1 ≤ x ≤ N and 1 ≤ y ≤ M for all k = 1..K. The set of links can also be referred to as a bitext map that aligns bitext positions with each other. Such a bitext map can then be used to induce an alignment A in the original sense

bitext links

Extracting bisegments from this bitext map can be seen as the task of merging text elements in such a way that the resulting segments can be mapped one-to-one without violating any connection.

Text linking: Find all connections between text elements from the source and the target text according to some constraints and conditions which together describe the correspondence relation of the two texts. The link structure is called a bitext map and may be used to extract bisegments.
Bisegmentation: Find source and target text segmentations such that there is a one-to-one mapping between corresponding segments

Segmentation

is the task of dividing a text into segments. Segmentation can be done at different levels of granularity, such as word, phrase, sentence, paragraph, or document level.

Segmentation

For alignment, to successfully align two texts, the segments should be of the same granularity.

It is often fustrating to align hebrew texts with its rich morphology to english because one hebrew words frequently matches to several english words. Annotators will then segment the hebrew words with one letter in some segments, which may correspond to a english word e.g. a particle

different granularity of segmentation are:

morpheme (sub-word semantic segmentation)
character segmentation
word segmentation
token segmentation
lemma segmentation (token clusters)
n-gram segmentation
phrase segmentation
sentence segmentation
paragraph segmentation
syntactic constituent segmentation

Basic entropy/statistical tools should be useful here to identify and learn good segmentation for the different languages and possibly how to align them. I.e. where morpheme boundries lie and where clause/phrase boundries lie.

This is where another idea comes in, Some advanced TS models can model local behavior as well as long term behavior in a single model.

look into:

Citation

BibTeX citation:

@online{bochman2024,
  author = {Bochman, Oren},
  date = {2024-12-20},
  url = {https://orenbochman.github.io/posts/2024/2024-10-01-alignment.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2024. December 20, 2024. https://orenbochman.github.io/posts/2024/2024-10-01-alignment.html.