Pages in topic:   < [1 2 3] >
First steps with CafeTran Espress 2015 (video)
Thread poster: Dominique Pivard
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 18:13
No, it doesn't Jun 23, 2015

Samuel Murray wrote:

Meta Arkadia wrote:
And of course, CafeTran also offers that feature, unfortunately called "Prefix matching" in the Options menu.


I would appear then that CafeTran would automatically match "house" to "houses", but will it also match "woman" to "women" and "mouse" to "mice"?


No, it cannot do that nice trick.

Igor wrote that he had been experimenting with the same tokenizer engine that oT uses, last summer. For reasons that I've forgotten, he didn't implement it.


 
Meta Arkadia
Meta Arkadia
Local time: 00:13
English to Indonesian
+ ...
Nope Jun 23, 2015

Samuel Murray wrote: ...but will it also match "woman" to "women" and "mouse"

Of course not. This feature only suggests terms based on the first few letters.



So "woman/women" would trigger a match if you set the prefix length to 3, "mouse/mice" if you'd set it to 1. Trouble is, that it would then trigger so many matches that it's useless. I may give this prefix matching another try with German as the source language, and a rather high (6?) matching threshold. OT's tokeniser is in another league, as you know perfectly well. It's rule-based. Unfortunately, I don't seem to be able to make OT work for me, a pity, because I find the tokeniser concept intriguing.


Cheers,

Hans


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 17:13
Member (2009)
Dutch to English
+ ...
Look away! Jun 23, 2015

Meta Arkadia wrote:

esperantisto wrote: And you don’t need to monkey around with the pipe symbol


And of course, CafeTran also offers that feature, unfortunately called "Prefix matching" in the Options menu. I tested it for English (yes, the animated GIF is way too fast, sorry)...



... and it works as advertised, but in real-life, results were disappointing. English isn't the ideal source language for this feature, I'm afraid. I already challenged our CATGuru to try the feature for Finnish, but so far, he didn't respond.

Cheers,

Hans


That's probably because your GIF scared him off

Michael

PS: For about 3 percent of people with epilepsy, exposure to flashing lights at certain intensities or to certain visual patterns can trigger seizures.

[Edited at 2015-06-23 11:16 GMT]

[Edited at 2015-06-23 11:16 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 17:13
Member (2009)
Dutch to English
+ ...
fuzzy magic = fuzzy mistakes? Jun 23, 2015

Samuel Murray wrote:

Meta Arkadia wrote:
And of course, CafeTran also offers that feature, unfortunately called "Prefix matching" in the Options menu.


It would appear then that CafeTran would automatically match "house" to "houses", but will it also match "woman" to "women" and "mouse" to "mice"?


[Edited at 2015-06-23 08:56 GMT]


That's exactly what I don't understand about all that fuzzy magic. It would seem to constantly be giving you a "fuzzy" translation. That's one reason why I prefer the precise matching of the tab-delimited glossaries. A house doesn't turn into a houses.

Michael


 
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 18:13
A better name for prefix matching? Jun 23, 2015

Meta Arkadia wrote:
And of course, CafeTran also offers that feature, unfortunately called "Prefix matching" in the Options menu.


So what would possibly be a better name for the feature known as 'prefix matching', Hans?

http://cafetran.wikidot.com/prefix-matching

Stemming?


 
Meta Arkadia
Meta Arkadia
Local time: 00:13
English to Indonesian
+ ...
Beats me Jun 23, 2015

2nl wrote: So what would possibly be a better name for the feature known as 'prefix matching'


I don't know. I'm a mere translator, not a terminologist.

Prefix matching is wrong, because "prefix" is a preserved linguistic term. Worse, the feature will hardly work for languages that use a prefixes. For example, in Bahasa Indonesia, the passive is created by the prefix di. The passive of the verb pukul is dipukil. Having pukul in your termbase is useless for this feature, and depending on your settings, dipukul will probably trigger too many hits because there a heaps of words that start with di (all passive verbs), a little les that start with dip (all passive verbs that start with a "p"), and so on. You're probably better of adding all conjugations and declinations to your termbase, although it may work for long words with the length set very high. We should test this (for German as an SL, say 6+ letters?).
A tokeniser should recognise dipukul as a conjugation of pukul, however.

Stemming is wrong, because the matching is not necessarily based on the stem of a word.

Partial word match?

Cheers,

Hans


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 17:13
Member (2009)
Dutch to English
+ ...
"stemming" vs "prefix matching" vs "partial word matching" vs ??? Jun 24, 2015

Meta Arkadia wrote:

2nl wrote: So what would possibly be a better name for the feature known as 'prefix matching'


I don't know. I'm a mere translator, not a terminologist.

Prefix matching is wrong, because "prefix" is a preserved linguistic term. Worse, the feature will hardly work for languages that use a prefixes. For example, in Bahasa Indonesia, the passive is created by the prefix di. The passive of the verb pukul is dipukil. Having pukul in your termbase is useless for this feature, and depending on your settings, dipukul will probably trigger too many hits because there a heaps of words that start with di (all passive verbs), a little les that start with dip (all passive verbs that start with a "p"), and so on. You're probably better of adding all conjugations and declinations to your termbase, although it may work for long words with the length set very high. We should test this (for German as an SL, say 6+ letters?).
A tokeniser should recognise dipukul as a conjugation of pukul, however.

Stemming is wrong, because the matching is not necessarily based on the stem of a word.

Partial word match?

Cheers,

Hans


You said, "Stemming is wrong, because the matching is not necessarily based on the stem of a word.", but is this true? Can you elaborate on that statement?

I just had a brief look at this page: https://en.wikipedia.org/wiki/Stemming … and the term "stemming" seems quite apt, not to mention it seems to be used in our industry to mean exactly what we are talking about.

"Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem needs not to be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Stemming programs are commonly referred to as stemming algorithms or stemmers.

[…]

Language challenges

While much of the early academic work in this area was focused on the English language (with significant use of the Porter Stemmer algorithm), many other languages have been investigated.[5][6][7][8][9]

Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of a greater number of verb inflections), a Russian one is more complex (more noun declensions), a Hebrew one is even more complex (due to nonconcatenative morphology, a writing system without vowels, and the requirement of prefix stripping: Hebrew stems can be two, three or four characters, but not more), and so on."
(https://en.wikipedia.org/wiki/Stemming )

Michael


 
Meta Arkadia
Meta Arkadia
Local time: 00:13
English to Indonesian
+ ...
The stem of all evil Jun 24, 2015

Michael Beijer wrote:
You said, "Stemming is wrong, because the matching is not necessarily based on the stem of a word.", but is this true? Can you elaborate on that statement?


[in my own words] The stem of a word is the basic form, without any prefixes/suffixes/affixes. Not to be confused with root.
I think my Indonesian example above, dipukul, says it all: The stem is pukul, CT's feature will start searching for d, not for p, the first letter of the stem. An example for English: unacceptable. The feature doesn't start looking for the stem "accept," but for "u."
Wrong term.

As I mentioned above, it could be a useful feature for German or Dutch, if you set the threshold for the first few letters very high. You should get hits for all those compounds. For English, I couldn't make it work. It may also be a bit out of date, superseded by Auto-Complete, which - I think - works in the same way, but without entering a hit in the target language pane (if Prefix Matching is set that way).

[Edit] Oops: "Auto-Complete" I still think it's the same process, but for the source language. So it's not "superseded" by AC at all.

Cheers,

Hans

[Edited at 2015-06-24 01:57 GMT]


 
Meta Arkadia
Meta Arkadia
Local time: 00:13
English to Indonesian
+ ...
Just for kicks Jun 24, 2015

the stem "accept"


I think we can safely say the stem is "accept" in English, but the "ac" part of it is originally a prefix, whereas the Indo-European root of "acceptable" is KAP. (Dictionnaire des racines des langues européennes, Larousse, Paris 1948)

Cheers,

Hans


 
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 18:13
So perhaps we should start calling this feature 'Stemming'? Jun 24, 2015

Thanks for your quote, Michael!

So perhaps we should start calling this feature 'Stemming' and ask Igor to change its name in Edit > Options > Memory?



 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 18:13
Member (2006)
English to Afrikaans
+ ...
It's "partial word matching", that's all Jun 24, 2015

2nl wrote:
So perhaps we should start calling this feature 'Stemming' and ask Igor to change its name in Edit > Options > Memory?


The feature has little in common with stemming, since the feature only matches partial words from left to right. It's really just "partial word matching". The fact that it can't match from right to left doesn't have to be mentioned in the UI. I agree that calling it "prefix matching" is very confusing, because it doesn't match prefixes -- it attempts to match words without their suffixes. But "unsuffixed matching" doesn't ring right.

There must be some programmatic reason why matching is only from left to right, because WFC has the same deficiency.


[Edited at 2015-06-24 10:42 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 17:13
Member (2009)
Dutch to English
+ ...
not quite stemming Jun 24, 2015

2nl wrote:

Thanks for your quote, Michael!

So perhaps we should start calling this feature 'Stemming' and ask Igor to change its name in Edit > Options > Memory?



Hi Hans,

It's not quite stemming either (as that involves "stems" and involves linguistic intelligence). Am still trying to think of the best term.

Michael

[Edited at 2015-06-24 12:05 GMT]

[Edited at 2015-06-24 12:05 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 17:13
Member (2009)
Dutch to English
+ ...
Hmm Jun 24, 2015

Samuel Murray wrote:

2nl wrote:
So perhaps we should start calling this feature 'Stemming' and ask Igor to change its name in Edit > Options > Memory?


The feature has little in common with stemming, since the feature only matches partial words from left to right. It's really just "partial word matching". The fact that it can't match from right to left doesn't have to be mentioned in the UI. I agree that calling it "prefix matching" is very confusing, because it doesn't match prefixes -- it attempts to match words without their suffixes. But "unsuffixed matching" doesn't ring right.

There must be some programmatic reason why matching is only from left to right, because WFC has the same deficiency.


[Edited at 2015-06-24 10:42 GMT]


What it has in common with the term "stemming" is that people who know the term "stemming", when used in relation to CAT tools, might think of what the so-called "prefix matching" feature in CafeTran actually does. I suppose this will depend on their level of linguistic knowledge – the more they have of it, the less likely they will be to understand CT's term

So stemming isn't quite right, as it involves "stems" (a specific linguistic term; see: https://en.wikipedia.org/wiki/Word_stem ) and (usually) involves linguistic intelligence.

Indeed, the prefix matching feature in CT "matches partial words from left to right". However, I don't agree with your statements "The fact that it can't match from right to left doesn't have to be mentioned in the UI." + "It's really just 'partial word matching'" I think the fact that it only matches the part of a word that is on the RIGHT is in fact very important to know, and leaving this out of the term will cause only confusion.

Since the term "prefix" seems to be out of the question too, as it also has a specific linguistic meaning ("an element placed at the beginning of a word to adjust or qualify its meaning" –Oxforddictionaries.com), what are we then to call: the beginning of any word, but devoid of any specific linguistic function or meaning?

Possible ideas for names:
– "pseudo-stemming"
– "non-linguistic stemming"
– "poor man's stemming"
– "Stemming Lite"
– "word-start matching"
– "word-beginning matching"
– "suffix unmatching" (this one's a joke)
– "the word cutter"
– "partial word matching" (sounds good, but is missing the left-to-right aspect)
– "(left) word-part matching"
– "left-to-right matching"
– …

* see also:

"Prefix: A prefix is placed at the beginning of a word to modify or change its meaning. Pre means "before." Prefixes may also indicate a location, number, or time.
Root: central part of a word.
Suffix: The ending part of a word that modifies the meaning of the word. Example: homeless. Root = 'home' and the suffix is 'less'. It can also refer to a condition, disease, disorder, or procedure. " (http://www.globalrph.com/medterm.htm )



[Edited at 2015-06-24 12:08 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 17:13
Member (2009)
Dutch to English
+ ...
food for thought Jun 24, 2015

esperantisto wrote:

2nl wrote:

…CafeTran … so packed with useful features .


Sorry to say that, but I haven’t seen that in the video.


OK, as promised, here is some more info re all the cool stuff CT can do (copy/pasted from the big thread going on at the moment in the [help] Yahoo list @ https://groups.yahoo.com/neo/groups/help_/conversations/topics/46740):

#########################################
#########################################
Michael:

It is indeed true that in CT you can add all kinds of great stuff to the tab-delimited txt glossaries, such as

(1) regular expressions (this feature is amazingly useful; you can use regex right inside glossary entries!)
(2) pipe characters (for stemming or, if placed at the start of a glossary entry, to indicate that it contains regexes)
(3) source-side synonyms + target-side synonyms (something DVX really could learn from!!! in CT, a glossary can have synonyms in it. just separate individual synonyms with a semi colon)
(4) reference material (in the form of a metadata column for: Notes, Usage example, Source/Origin, etc.)
(5) hyperlinks (which are clickable, to take you straight to a Wikipedia page with info on the relevant term, e.g.)

As you can see, CT's txt glossaries are almost as powerful as a full-blown MultiTerm termbase, but with the added power of regular expressions, plus they are in a cross-platform, UTF-8 text format that can be easily edited and maintained in a CSV or text editor (even on-the-fly while translating).

Over an out,

Michael

--------------------------------------------------

Hi Max,

It is indeed extremely exciting, and lots of fun. Working with my big CafeTran glossary is a joy. Similar to the regex Hans mentioned, I added around 25 different ones in my job last night, pieces of a solar power system manual.

My main glossary is a tab-delimited UTF-8 text file, which I choose to save as ".csv" file instead of the default ".txt", so it opens in Ron's CSV Editor (HIGHLY recommended to anyone on Windows who works with tab-delimited terminology files!) when double-clicked, or when I right-click and select "Edit glossary" from inside CafeTran when translating. The reason Ron's editor (or a decent UTF-8 aware CSV editor) is so great is that it allows you to see the contents of your tab-del file as if it were in Excel, that is, in nice visually clear rows & columns (this is invaluable, imo) and perform multi-level filtering: i.e., first filter on all entries with a certain Client field, and then narrow this down to all entries containing a regular expression. VERY handy for TB maintenance tasks!

Here are a few regexes I added yesterday:

.+? = any string
\d+ = a number
.? = any character, once or not at all
----------------------------------
#nl-NL [TAB] #en-GB
|ongeveer €\d+ [TAB] approximately €000
|conform artikel /d+ [TAB] in accordance with Article 00
|max. vermogen gedurende \d+ minuten [TAB] max. power for 00 minutes
|het uitgangsvermogen van de .+? aanpassen [TAB] adjust the output power of the XXX
|Kan de .+? worden gebruikt in België? [TAB] Can the XXX be used in Belgium?
|Is het verplicht een .+? te plaatsen [TAB] Is it mandatory to install a XXX?
|Is het verplicht om een .+? te plaatsen [TAB] Is it mandatory to install a XXX?
|Werkt het .+? als .+? [TAB] Can the XXX be used as a XXX?
|\d+,\d+ kWh [TAB] 00,000 kWh
|batterijspanning van \d+ V [TAB] battery voltage of 00 V
|\d+.?\d+ kWh [TAB] 0.0 kWh
|\d+ \W [TAB] 000 W
|controleren of .+? is inbegrepen [TAB] check whether XXX is included
|Kan ik de .+? rechtstreeks aankopen bij .+? [TAB] Can I purchase the XXX directly from XXX?
|Werkt mijn .+? in back-up modus? [TAB] will my XXX work in backup mode?
|ongeveer €\d+ [TAB] approximately €600
|wanneer ik geen .+? heb [TAB] if I don’t have a XXX
|\d+u [TAB] 0 hours
|\d+ kW [TAB] 00 kW
|Kan ik een .+? gebruiken [TAB] Can I use an XXX?
|indien u de .+? in back-up modus wilt gebruiken [TAB] if you wish to use the XXX in backup mode
|\d+kVA [TAB] 5kVA


Not entirely sure about spaces between stuff like kVA, kW and numbers, but you get the picture! These little gems can save you a lot of work, not to mention it's just fun when you create one and you get to see it work in your next job

My glossary file consists of the following 10 columns:

#nl-NL [TAB] #en-GB [TAB] #Context [TAB] #Subject [TAB] #Client [TAB] #Note [TAB] #Sense [TAB] #Usage example [TAB] #Source [TAB] #URL

Note that this is just my own personal preference. CafeTran lets you create as many (or as few) as you want, so you can create a full-blown termbase system, or you can just use a basic src/trgt, or src/trft/subject/client structure. Whatever floats your boat: Igor hasn't made anything mandatory. There are a few defaults that you need to stick to (I think Context needs to come third), to but basically the rest is up to you.

Hans van den Broek hates these files, but there is no need to use them if you don't want to and the system they are built on has absolutely no effect on the rest of CT. Let me stress this: all the new features I requested re: CT's tab-delimited txt glossaries, and which Igor implemented, none of these impact the system as a whole in any way. All the pipe characters and regexes, and synonyms, etc: if you don't like them, just don't use them. Hell, if you don't go looking for them, you wouldn't even know they were there. Hans just doesn't like me and likes to complain, so he tries to make it look like I ruined something, whereas I have actually contributed a large portion of the new ideas over the last year or two. The fact of the matter is CT is just getting better and better, period.

Tab-dels (as they are sometimes freferred to by CT users) are just another format to store terminology in, in CT. You can also store your terms in TMXs, which CT calls "TMX termbases", or "Memories for Terms" in the older lingo. Obviously, storing terms in TMX files has all manner of limitations (the main one being it can't properly handle synonyms, or bundled entries covering a single concept), but they do have the benefit of allowing you to access the fuzzy matching functionality of the translation memories (which has limited value, of you ask me, but then I'm no expert in this area as I generally don't store any terms in TMXs, only project segments).

In my system:

#nl-NL (source term; can contain synonyms; just separate with a ";"))
#en-GB (target term; can contain synonyms; just separate with a ";")
#Context ("Contextual Priority" aka "Context-aware Auto-assembling" (C-3A); see: http://cafetran.wikidot.com/using-context-aware-auto-assembling)
#Subject (can be used to give the entry priority in the auto-assembling system)
#Client (can be used to give the entry priority in the auto-assembling system)
#Note (I use this for various purposes)
#Sense (I use this to distinguish terms from one another)
#Usage example (self-explanatory)
#Source (where you found the term; I use this comumns to manage the glossary as a whole later in Ron's CSV Editor)
#URL (clickable hyperlink. e.g., to quickly take you to a relevant Wikipedia article, or even file on your computer, when translating)

These are just mine. You can of course come up with your own, better system!

Also note that it is not necessary to have all these fields shown in the glossary pane, you can hide any of them you want. See e.g.:

"It is now possible to hide specific metadata fields in the Glossary pane. This can be useful to keep the Glossary pane from getting too messy if you tend to enter a lot of metadata. The fields you wish to hide can now be defined by comma separated numbers. Go to: Edit > Options > Glossary > Fields to hide, and enter a number for each field you wish to hide. Imagine your Glossary has 10 fields (#nl-NL –– #en-GB –– #Context –– #Subject –– #Client –– #Note –– #Sense –– #Usage example –– #Source –– #URL), and you want to display only the ‘sense’ field (and of course the source and target term). You would therefore enter: ‘1,2,3,4,6,7,8’. Only the source, target, and the sense field will now be displayed in your Glossary pane." (http://cafetranhelp.com/changelog)

E.g., you might enter lengthy definitions, but not want them to clog up your glossary pane: just hide them. You will still be able to see them by merely hovering over the relevant term in the glossary pane.

CafeTran's terminology system really is the most powerful and flexible of any CAT tool, imo. And yes, in terms of features that actually matter to us translators, I would say it even beats MultiTerm and Transit, often lauded as having very powerful terminology systems.

#########################
And all this is only the TERMINOLOGY FUNCTIONALITY of CafeTran. I haven't even mentioned the "Total Recall" system, which recently got a major upgrade allowing us to use the same SQLite db format that Farkas András uses in his amazing tool TMLookup (http://www.farkastranslations.com/tmlookup.php). The most important effects of this are:

(1) the Total Recall system is now much, much faster. So fast, in fact, that I can now pre-translate my entire project against my massive collection of TMXs (including all the DGT-Tms ever released for NL/EN, all the CELEX stuff from http://www.farkastranslations.com/eu_translation_memories.php, all my own TMs from over the years, all kinds of stuff from the old Opus site (now dead?) opus.lingfil.uu.se/, + the new site @ http://datahub.io/dataset/opus, and god knows how many TMs downloaded from the TAUs Data site) ... in under a minute. I actually still have my entire collection of TMXs inside memoQ, and trying something similar would take hours and hours in memoQ (I have no idea how long exactly, because I have never actually let it finish). My massive collection of TMXs actually contains around 45 million TUs and the TMLookup SQLite db file is around 25Gb on disk! CT can search through this WHOLE thing, looking for possibly useful matches in your current project in under a minute!

(2) You can now use your default TMlookup DB file (default.db) inside CafeTran's Total Recall system without having to change or edit it at all! What's more, SQLite DBs allow concurrent lookups, so you don't even have to close TMLookup!

#########################

Hope all this made some sense! I really hate writing long posts in this silly Yahoo Groups interface. It really is a piece of &^%$. I actually wrote the whole thing in a text editor and switched the Yahoo interface to plain text, but even then the stupid system will probably double all my line endings and otherwise garble my message. Can't wait for the new CafeTran forum there has been talk about recently: an actual forum where you can write in peace and present people with a clear message.

Michael
#########################################
#########################################


 
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 18:13
This one is nice too :) Jun 25, 2015

Michael Beijer wrote:

(1) regular expressions (this feature is amazingly useful; you can use regex right inside glossary entries!)


My latest addition:



Which glossary entry auto-assembles as:



 
Pages in topic:   < [1 2 3] >


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie[Call to this topic]

You can also contact site staff by submitting a support request »

First steps with CafeTran Espress 2015 (video)






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »