yalign

A sentence aligner for comparable corpora

Showing:

Popularity

Downloads/wk

0

GitHub Stars

126

Maintenance

Last Commit

6yrs ago

Contributors

7

Package

Dependencies

0

License

UNKNOWN

Categories

Readme

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation <http://en.wikipedia.org/wiki/Statistical_machine_translation> relies on parallel corpora <http://en.wikipedia.org/wiki/Parallel_text> (eg.. europarl <http://www.statmt.org/europarl/>) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora <http://www.statmt.org/survey/Topic/ComparableCorpora>. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn <http://scikit-learn.org/stable/install.html>_.

After that you can install Yalign from PyPi via pip:

::

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

::

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz 

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

::

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs <http://yalign.readthedocs.org/>_.

The Yalign Team:

Yalign is a Machinalis <http://www.machinalis.com> project. You can view our other open source contributions here <https://github.com/machinalis/>.

| Andrew Vine | Gonzalo García Berrotarán | Rafael Carrascosa | Elías Andrawos | Laura Alonso Alemany

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100