Trim internal fuzzies (AutoIt script)
Thread poster: Samuel Murray
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:48
Member (2006)
English to Afrikaans
+ ...
Aug 16, 2020

Hello everyone

I have written a very simple, work-in-progress, proof-of-concept set of scripts that searches a list of sentences for internal fuzzy matches and then groups internal fuzzy matches together. This would potentially allow one to share a job among multiple translators while preventing internal fuzzy matches from being split between translators (who would otherwise not have any benefit from such matches, since the "other" sentences may have been given to other translators
... See more
Hello everyone

I have written a very simple, work-in-progress, proof-of-concept set of scripts that searches a list of sentences for internal fuzzy matches and then groups internal fuzzy matches together. This would potentially allow one to share a job among multiple translators while preventing internal fuzzy matches from being split between translators (who would otherwise not have any benefit from such matches, since the "other" sentences may have been given to other translators).

These are AutoIt scripts, so you need AutoIt installed on your computer to use them. The input file format is plain text, UTF8 with BOM, one sentence per line. The output file contains the same sentences in the same order, except that internal fuzzy matching segments from later in the list are moved up and grouped with the earliest one of the matches.

http://www.leuce.com/autoit/trim%20internal%20fuzzies%20v01.zip

There are four scripts, using Levenshtein and Sift2 fuzzy matching, using two different methods of searching. The inner workings of both Levenshtein and Sift2 are Greek to me, so don't bother asking such technical questions. (-:

Samuel

PS. If you know of any CAT tool that can do this sort of thing (i.e. either remove/export or group internal fuzzies in a file), please please let me know. It should be a standard feature, but it isn't.

---------------------------------------------------------------------------
Added:

If a tool or CAT tool has a feature such as the one I'm talking about, one should be able to tweak its settings to reduce this list of 20 segments down to fewer than ten, or even down to a list of four:

This is the house that the Jack built for his friend the alligator.
This is the house that the Jack built for his friend the bear.
This is the house that the Jack built for his friend the camel.
This is the house that the Jack built for his friend the dolphin.
This is the house that the Jack built for his friend the elephant.
In the Old West, cowboys and their wives ate only fish.
In the Old West, cowboys and their wives ate only giraffe.
In the Old West, cowboys and their wives ate only hippo.
In the Old West, cowboys and their wives ate only insect.
In the Old West, cowboys and their wives ate only jellyfish.
The kangaroo went on a long holiday and never returned.
The lion went on a long holiday and never returned.
The monkey went on a long holiday and never returned.
The newt went on a long holiday and never returned.
The owl went on a long holiday and never returned.
Everyone agrees that the cutest animals are penguin and quail.
Everyone agrees that the cutest animals are raccoon and seal.
Everyone agrees that the cutest animals are tiger and unicorn.
Everyone agrees that the cutest animals are viper and whale.
Everyone agrees that the cutest animals are x-ray fish, yak and zebra.


If I run one of my Levenshtein scripts on this list, I get these results:

Fuzzy threshold: 75%
This is the house that the Jack built for his friend the alligator.
In the Old West, cowboys and their wives ate only fish.
The kangaroo went on a long holiday and never returned.
Everyone agrees that the cutest animals are penguin and quail.


Fuzzy threshold: 85%
This is the house that the Jack built for his friend the alligator.
In the Old West, cowboys and their wives ate only fish.
The kangaroo went on a long holiday and never returned.
The owl went on a long holiday and never returned.
Everyone agrees that the cutest animals are penguin and quail.
Everyone agrees that the cutest animals are raccoon and seal.
Everyone agrees that the cutest animals are tiger and unicorn.
Everyone agrees that the cutest animals are viper and whale.




[Edited at 2020-08-17 08:08 GMT]
Collapse


 
James Plastow
James Plastow  Identity Verified
United Kingdom
Local time: 13:48
Member (2020)
Japanese to English
Excel Aug 16, 2020

Perhaps this fuzzy match add-in for Excel might also be able to do the job? (I have not tried it)

https://www.microsoft.com/en-us/download/details.aspx?id=15011


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:48
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@James Aug 16, 2020

James Plastow wrote:
Perhaps this fuzzy match add-in for Excel might also be able to do the job? (I have not tried it)


No, the Fuzzy Lookup add-on in Excel compares two tables and tries to match data from the one table to data in the other table. It does not (and can't) match cells from within a single table (i.e. a single list of sentences) to each other.

https://www.youtube.com/watch?v=3v-qxcjZbyo


 
James Plastow
James Plastow  Identity Verified
United Kingdom
Local time: 13:48
Member (2020)
Japanese to English
ablebits Aug 16, 2020

I see,

https://www.ablebits.com/docs/excel-find-fuzzy-duplicates/

is another one, but I guess you have already looked into the available options.


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:48
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@James Aug 16, 2020

James Plastow wrote:
https://www.ablebits.com/docs/excel-find-fuzzy-duplicates/
is another one, but I guess you have already looked into the available options.


No, I haven't look at Excel tools at all.

The Ablebits one appears to be a potential solution, but unfortunately it went straight to end-of-demo on my computer, so I was unable to test it. It doesn't appear that one can set a fuzzy match percentage; rather, one sets a number of characters that differ. So it may work, but it may also not (since it was really designed to find fuzzy matches in short field data, e.g. people's names or addresses). It is a bit expensive, though: $99. Here's a video of it:
https://www.youtube.com/watch?v=2Tc5Ifl2bX4


 
James Plastow
James Plastow  Identity Verified
United Kingdom
Local time: 13:48
Member (2020)
Japanese to English
google sheets Aug 16, 2020

OK, if you haven't looked into it, it may also be worth searching for Google Sheets fuzzy match add-ins.
Seems like there are quite a few options, for example Flookup.


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:48
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Okay, we should investigate various Google Sheets and Excel add-ons Aug 16, 2020

James Plastow wrote:
It may also be worth searching for Google Sheets fuzzy match add-ins. For example Flookup.


Okay.

FWIW, Flookup itself doesn't help with this problem: it can only remove (or highlight) all fuzzy matches (and not just all except 1), and in my tests it flagged widely divergent sentences of divergent lengths as being fuzzy matches. And the highest match threshold is 90% (it can also do 80%, 70%, 60% etc, but not e.g. 93%, 95% etc.). Price: $10 per month.


 
James Plastow
James Plastow  Identity Verified
United Kingdom
Local time: 13:48
Member (2020)
Japanese to English
workaround Aug 16, 2020

I have been playing around with it,
one way is to say
> Approximate fuzzy matches as segments with duplicate substrings

So, use TextSTAT (free) to find the most frequent substrings in the text. (paste the source into Notepad then open in TextSTAT)
* Actually this works nicely for Japanese but only matches individual words with English. There should be some software that will analyze phrase frequency in Western languages.

Export the list of frequent sub
... See more
I have been playing around with it,
one way is to say
> Approximate fuzzy matches as segments with duplicate substrings

So, use TextSTAT (free) to find the most frequent substrings in the text. (paste the source into Notepad then open in TextSTAT)
* Actually this works nicely for Japanese but only matches individual words with English. There should be some software that will analyze phrase frequency in Western languages.

Export the list of frequent substrings and open it in Excel

Tidy up the list if necessary

Use the textjoin function with | as the delimiter to create a search term for all the frequent substrings

Paste this into the Trados filter box


No good for specific percentages of fuzzy matches but it does let you filter for segments with repeated phrases

[Edited at 2020-08-16 19:55 GMT]

[Edited at 2020-08-16 19:58 GMT]
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:48
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@James Aug 16, 2020

James Plastow wrote:
Find the most frequent sub-strings in the text [and filter segments by it in Trados].
It does let you filter for segments with repeated phrases.


That may be so, but that is not what I'm trying to achieve. Take a look at these two sentences:
- The rain in Spain falls mainly on the plains in October and November each year.
- In the Old West, cowboys sat on their horses mainly on the plains where the grasses grew.
Both contain the sub-string "mainly on the plains" but they are by no means fuzzy matches of each other.

I'm not convinced that one can say that segments are fuzzy matches of each other if they share frequent sub-strings... but even if one could, non-fuzzy matches may also contain those sub-strings, and we don't want to flag non-fuzzy matching segments.


[Edited at 2020-08-16 21:18 GMT]


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Transit NXT Aug 17, 2020

https://www.star-spain.com/en/blog/transittermstar-nxt-tooltips/creating-translation-extracts-and-reference-extracts

 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:48
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Hans Aug 17, 2020



Thanks. I don't have NXT but from what I can tell from that blog post, these two features are not what we're looking for.

According to the blog post (and from the screenshots), in NXT one can create two types of reduced sets of data, namely a "translation extract" (which extracts all untranslated segments, for re-import later) and a "reference extract" (which, and I'm guessing, extracts TUs from TMs and possibly also glossaries).

The reference extract does have the option of specifying a fuzzy threshold, so if NXT retains multiple instances of TUs in its TMs, then perhaps this feature can be used after all: create a source=target TM from the source file, then run a "reference extract" with the source file against that TM only, and then export it to a format that one can process (e.g. TMX), and removing any TUs that occur only once (TUs that occur more than once would be TUs that were a match for more than just its own segment), and then convert that to a new plaintext source file with one segment per line, and remove duplicate lines. This all hinges on the assumption that NXT writes (retains) multiple instances of identical translations into its own TM... or that its TM system contains a segment re-use counter.

See also my updated first post with a test file.

[Edited at 2020-08-17 08:08 GMT]


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 14:48
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
New, updated version Aug 18, 2020

http://www.leuce.com/autoit/trim%20internal%20fuzzies%20v03a.zip

New improved version, with non-working scripts removed, and with more settings to speed things up (by reducing the number of matches found). An input list of anything over 1000 sentences runs the risk of running for an hour or more. The "type 1" script is the fastest for long texts. Also expor
... See more
http://www.leuce.com/autoit/trim%20internal%20fuzzies%20v03a.zip

New improved version, with non-working scripts removed, and with more settings to speed things up (by reducing the number of matches found). An input list of anything over 1000 sentences runs the risk of running for an hour or more. The "type 1" script is the fastest for long texts. Also exports a second file with only the fuzzy segments. Real-world example included in the ZIP file.





[Edited at 2020-08-18 20:16 GMT]
Collapse


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Trim internal fuzzies (AutoIt script)







CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »