The Free Dictionary  
mailing list For webmasters
Welcome Guest Forum Search | Active Topics | Members

MASAKARI: The people's choice 'General Purpose Grade' English wordlist Options
leonAzul
Posted: Friday, June 23, 2017 10:27:37 PM

Rank: Advanced Member

Joined: 8/11/2011
Posts: 8,349
Neurons: 26,527
Location: Miami, Florida, United States
Sanmayce wrote:


It is a shame such a monstrous corpus to lack such a must-have as bansheeesque.

To me, all those nonhyphened (non-hyphened) variants are cooler than hyphened ones.


When you get to your second or third eyeglass prescription you might change your mind about the hyphen.
Dancing


"Make it go away, Mrs Whatsit," he whispered. "Make it go away. It's evil."
leonAzul
Posted: Saturday, June 24, 2017 12:33:45 AM

Rank: Advanced Member

Joined: 8/11/2011
Posts: 8,349
Neurons: 26,527
Location: Miami, Florida, United States
Sanmayce wrote:

In the 'Prince Igor' lyrics, at the end, one interesting to me question arises:
How "But I'm down for ma biatch" differs, if at all, from "down with"?


It's the difference between "for" and "with".

To be down for means to be open to. To be be down with means to be in agreement. Yet there's another sense in which "down for" can mean "supportive of" in a way that is much more intense than merely "down with".


"Make it go away, Mrs Whatsit," he whispered. "Make it go away. It's evil."
leonAzul
Posted: Saturday, June 24, 2017 12:55:19 AM

Rank: Advanced Member

Joined: 8/11/2011
Posts: 8,349
Neurons: 26,527
Location: Miami, Florida, United States
Sanmayce wrote:

Looking up the idiom "pull under":

pull someone or something under
1. Lit. to drag someone or something beneath the surface of something. The strong undertow pulled John under the surface. The whirlpool nearly pulled the boat under.
2. Fig. to cause someone or something to fail. The heavy debt load pulled Don under. He went out of business. The recession pulled his candy shop under.


My opinion is that a metaphorical use of the first definition will get you there quicker. The sense is very much like what Jim Morrison wrote: "No one gets out of here alive," so we might as well accept that fact and live as much as we can.


Sanmayce wrote:

Thus, combining Moore's and Hamlet's lyrics:

"Lyricist Kevin Moore refers to Shakespeare's Hamlet, as told from Prince Hamlet's point of view.[2] The lyrics allude heavily to the play, echoing Hamlet's desire to give in to his urge to gain revenge for his father at the cost of his own sanity. Over the final moment of the song, James LaBrie can be heard singing the song's only direct quote from the play: "O, that this too, too solid flesh would melt". Therein, Prince Hamlet is pleading for escape from his mortal trappings."

O, that this too, too solid flesh would melt,
Thaw, and resolve itself into a dew!
Or that the Everlasting had not fix'd
His canon 'gainst self-slaughter! O God! God!
How weary, stale, flat, and unprofitable
Seem to me all the uses of this world!

—Prince Hamlet in Hamlet, Act I Scene II

Dream Theater and Shakespeare form a strong duo inhere, the core of my assumption is "resolve itself into a dew" which to me is equal to full immersion into the world i.e. accepting/living the world as it is.


To my ear, "resolve itself into a dew" is a pretty way to say vaporize. Hamlet seeks to have his body liquified, sprayed on the ground, and evaporated in the morning sun.

And speaking of Hamlet, "pull me under" sounds a great deal like something Ophelia would say.



"Make it go away, Mrs Whatsit," he whispered. "Make it go away. It's evil."
Sanmayce
Posted: Saturday, June 24, 2017 10:52:40 AM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
Thank you shahidmost, really appreciate the appreciation.

There is something mystical about logosphilia, I mean mystique indeed. To me, this is a realm of its own behind too many activities, unaware we are of it most of the time.

Leon gave the formal, widespread and mostly the accepted in the books term, however, despite my forevermore buggy English, I see it little differently.

Our Greek neighbors are the source (Sanskrit is not to be forgotten though) their language influenced our language heavily. As for English, my take is this:

Don't know why the 's' is trimmed, 'logo' as we all know is an emblem, thus we overlap the 'lover of emblems' which creates problems, to avoid them my suggestions are:

logos+phile derives from 'λόγος'
frasi+phile derives from 'φράση'
lexis+phile derives from 'λέξις'

The corethread is that restricting lovingness to words invites the coming of a broader term - lover of phrases too. I consider myself a phrase hunter/gatherer/amasser/appreciator and naturally a word explorer. Some phrases (especially when sung) spellbind my mind, love it, the magic usually lasts for several days.

I would love to hear, from some fellow versed in Greek, what he/she thinks.

Leon, shahidmost, I salute you with Dream Theater - Pull Me Under (Cover), really got my attention since yesterday.

Add-on:
Leon, just saw your helpful/useful clarifications, thanks for 'down+*' explanations.
Thanks also for correcting 'mine' with 'ma', thought there was something fishy but gave the lyrics dumper benefit-of-the-doubt.

>To my ear, "resolve itself into a dew" is a pretty way to say vaporize. Hamlet seeks to have his body liquified, sprayed on the ground, and evaporated in the morning sun.
Word up, I stand corrected, in fact the logic dictates what you say - he wanted to do it but didn't like it hence his wish to leave the "dramadom", love this word already, yet Leon, the lyricist enriched the scene by adding a wonderful twist, namely, immersion and accepting the "play" - so much love the 'Pull me under I'm not afraid' - hits the heart.

>And speaking of Hamlet, "pull me under" sounds a great deal like something Ophelia would say.
Hee-hee, guess if that new sense is to be added it has to be rated kinda 16+.

He learns not to learn and reverts to what all men pass by.
leonAzul
Posted: Monday, June 26, 2017 12:27:04 AM

Rank: Advanced Member

Joined: 8/11/2011
Posts: 8,349
Neurons: 26,527
Location: Miami, Florida, United States
Sanmayce wrote:

>And speaking of Hamlet, "pull me under" sounds a great deal like something Ophelia would say.
Hee-hee, guess if that new sense is to be added it has to be rated kinda 16+.


Hadn't thought it out that way, I was referring to her drowning, yet there is also this famous exchange:

HAMLET
Lady, shall I lie in your lap?
Lying down at OPHELIA's feet

OPHELIA
No, my lord.

HAMLET
I mean, my head upon your lap?

OPHELIA
Ay, my lord.

HAMLET
Do you think I meant country matters?

OPHELIA
I think nothing, my lord.

HAMLET
That's a fair thought to lie between maids' legs.

OPHELIA
What is, my lord?

HAMLET
Nothing.

Whistle

"Make it go away, Mrs Whatsit," he whispered. "Make it go away. It's evil."
leonAzul
Posted: Monday, June 26, 2017 2:55:22 AM

Rank: Advanced Member

Joined: 8/11/2011
Posts: 8,349
Neurons: 26,527
Location: Miami, Florida, United States
Sanmayce wrote:


There is something mystical about logosphilia, I mean mystique indeed. To me, this is a realm of its own behind too many activities, unaware we are of it most of the time.

Leon gave the formal, widespread and mostly the accepted in the books term, however, despite my forevermore buggy English, I see it little differently.

Our Greek neighbors are the source (Sanskrit is not to be forgotten though) their language influenced our language heavily. As for English, my take is this:

Don't know why the 's' is trimmed, 'logo' as we all know is an emblem, thus we overlap the 'lover of emblems' which creates problems, to avoid them my suggestions are:

logos+phile derives from 'λόγος'
frasi+phile derives from 'φράση'
lexis+phile derives from 'λέξις'

The corethread is that restricting lovingness to words invites the coming of a broader term - lover of phrases too. I consider myself a phrase hunter/gatherer/amasser/appreciator and naturally a word explorer. Some phrases (especially when sung) spellbind my mind, love it, the magic usually lasts for several days.

I would love to hear, from some fellow versed in Greek, what he/she thinks.



I don't claim to be well-versed in Greek, but I do know something about how English works. In spoken form it prefers alternating simple consonants with vowels of varying complexity. Cetera paribus, "logo-" is the preferred combining form with "-phile" because it is easier to say for most native speakers of English.

The word "logo" is attested to around 1937 as an abbreviated version of logogram.

The combining form "logos-" would not make sense for most native speakers unless in a very limited way. Since the King James translation of the New Testament, the word "Logos", as used in the Gospel according to John, has acquired a very specific meaning within a specific context. After all these years, that book continues to have an influence on the English language.



"Make it go away, Mrs Whatsit," he whispered. "Make it go away. It's evil."
Sanmayce
Posted: Tuesday, July 4, 2017 1:30:07 PM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
A hardware meltdown in my laptop sent it to the machine limbo and left me without Internet for 7+ days, just now I managed another one operational.

Thanks Leon for the prompt hint.

All logo derivatives found on the superb 'Online Etymology Dictionary' site.

>Sometime ago, I asked a question here in this very forum: what is the word for a lover of words?

My first choices/choosements would be:

- wordsmith (after locksmith, silversmith (an artist dealing with silver; a person who makes articles out of silver), swordsmith);
- logolater;
- logomaniac (logomaniac (n.) Look up logomaniac at Dictionary.com "one mad for words," 1870; see logo- "word" + -mania.);
- logomancer (after necromancer : A person who practices necromancy.);
- logomach (logomach plural logomachs) Someone who argues about the meaning of words.);


The derivements are after:

logolatry (n.) Look up logolatry at Dictionary.com
"worship of words," 1810 (Coleridge), from logo- + -latry.


Since an idolater is a practitioner of idolatry, then I would call myself a logolater.

https://en.wiktionary.org/wiki/necromancy

Diving somewhat more:

It makes me rethink my choice-on-prima-vista, thus WORDCRAFT AFFICIONADO, yes not a monolithic wording but conveys fully what I mean.
You see, 'witchcraft', 'warcraft' bring the "precedent/blueprint", to push it farther, WORDMANSHIP FANATIC as in 'sword+craft' versus 'sword+manship'.
'Craft+y' being a sister word to 'master+y'.

The simplification leads to:
WORD FANATIC
or glued to:
WORDFAN

As for 'phrase lover/fan' dressed in Greek, is 'frasi+phile' plausible or somewhat an 'o' infix is needed as glue, that is, FRASOPHILE?!
Or, we have to follow the pattern as in etymology/phraseology leading to etymologist/phraseologist.
phraseologist - A collector or coiner of phrases.
So, this one is really on the point.

https://en.wiktionary.org/wiki/phraseogram

>The word "logo" is attested to around 1937 as an abbreviated version of logogram.
How do you accept 'logogram+mer', or 'phraseogram+mer'? How come we don't have words for the manager of these two! My drift is 'drum+mer'.

This is a fun fact:
http://www.etymonline.com/index.php?term=phraseology&allowed_in_frame=0



He learns not to learn and reverts to what all men pass by.
leonAzul
Posted: Thursday, July 6, 2017 10:17:50 AM

Rank: Advanced Member

Joined: 8/11/2011
Posts: 8,349
Neurons: 26,527
Location: Miami, Florida, United States
Sanmayce wrote:

- logomach (logomach plural logomachs) Someone who argues about the meaning of words.);


Something about this neologism really tickles my ear. 8^)

Please allow me, however, to clarify the definition. A logomach would be more properly someone who fights over, or with, words.

A person who argues reasonably about words would be

— wait for it—

a logologist.

Dancing


"Make it go away, Mrs Whatsit," he whispered. "Make it go away. It's evil."
leonAzul
Posted: Thursday, July 6, 2017 12:51:41 PM

Rank: Advanced Member

Joined: 8/11/2011
Posts: 8,349
Neurons: 26,527
Location: Miami, Florida, United States
Sanmayce wrote:

How do you accept 'logogram+mer', or 'phraseogram+mer'? How come we don't have words for the manager of these two! My drift is 'drum+mer'.


"Logogrammer" seems fair dinkum to me.

I would question the utility of "phraseogram" when "logogram" already exists, and I find a "distinction without a difference" between them. I suppose if one needed that word to complete a limerick then I would give it a bye.

Whistle

"Make it go away, Mrs Whatsit," he whispered. "Make it go away. It's evil."
Sanmayce
Posted: Friday, July 7, 2017 4:26:09 PM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
>"Logogrammer" seems fair dinkum to me.

Same here, a sister word to program+mer, genesiswise. The very shape and curves of "Logogrammer" beg the question, why the heck it has not been coined for so long?
After googling it, it shows sporadic appearances in some .NO and .DK domains, that is, the smart wordsmiths in Norway and Denmark made good.

A little snack for thought:
phonogram (plural phonograms)
1. (linguistics) A character or symbol (grapheme) that represents a sound, as opposed to logograms and determinatives.
2. (law) An audio recording, regardless of physical format.


phonogram+mer would be:

1. A linguist working in transcription department, (a loose coinage).
2. A person/device recording audio, on vinyl, paper, laserdisc ...

Related one, would be audiologger.

>I would question the utility of "phraseogram" when "logogram" already exists, and I find a "distinction without a difference" between them.

Yes, "logogram" is kinda inclusive, yet, the terms 'n-gram' or 'n-arc' used by Google overlap with "frasogram" encompassing monogram, bigram/arc, trigram/biarc, ...
As in the phrase:
Black Cumin Seed

the 2-grams being:
Black_Cumin
Cumin_Seed


the 3-gram being:
Black_Cumin_Seed

the arcs being:
Black_Cumin
Black_Seed
Cumin_Seed

--------------
| ----- ---- |
| | | | | |
Black_Cumin_Seed

the biarc being:
----- ----
| | | |
Black_Cumin_Seed

For more on arcs: http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html

The good news:



After a week or so (one-third is done), I am gonna upload the bixgram corpus of English Wikipedia from 01-Jan-2017 to one of my Internet drives - to be freely downloadable by all, then a functionality not provided by Wikimedia team will be available, when wanting to know all 'nigella' family as:
nigella sativa
nigella damascena 'Love in a mist'
nigella damascena 'Alba'
nigella ciliaris
nigella orientalis ‘Transformer’
...


The bixgrams (x-grams of order 2) are ~ 512x610,000= 312,320,000, right on! When wanting to know what words precede or follow your word-of-interest the 'Nigella' bixgram corpus derived from English Wikipedia will assist...
After finishing ripping I will show inhere examples how to query the corpus, and remember I am the man who found 4 wrong appearances of 'Sylvester Stallone' outwith the discussion sections - which is unacceptable:
The exhaustive fuzzy search (the most powerful search mode known to me) spotted the following typos (outside the redirect tag):
Sylvester Stalone
Sylvestor Stallone
Slvester Stallone
Silvester Stallone
Obviously even Wikipedia is not proofed fully, guess an ocean of people would misspell the name of the beloved actor as well.
When using Nigella you will be sidekicked/strengthened by all the 2-grams coming with their ranking, so you will quickly be able to spot the bugs.

While drinking a first-class beer a few days ago I noticed the motto of Starobrno brand - 'Arcanum boni tenoris animae' - The secret behind a good mood (motto of the Starobrno Brewery in Brno), and came to the idea of creating dedicated 2-gram corpus out of English Wikipedia from 01-Jan-2017 dump. My motto will mimic it:

Latin: Arcanum boni graphii animae
English: The secret of a good graphium

graphium: pen, pencil, writing style

https://en.wiktionary.org/wiki/graphii#Latin
https://en.wiktionary.org/wiki/tenoris
https://en.wiktionary.org/wiki/animae

Literally, "The secret of a good writing style soul."
Or, more loosely, "The secret of soulful writing style.", no?

The logo/emblem of the corpus is the seeds of the Black Cumin/Seed a.k.a. Nigella Sativa. Its/Hers antihelminthic/helminticide properties carry the spirit of killing the bugs (a jargon for errors) while correcting 2-grams, i.e. killing the buggy bigrams.

Nigella/Kalonji Corpus

An attempt to cover most feminine niger variants...

kalonji (uncountable): The seeds of the plant Nigella sativa used as a spice.
https://en.wiktionary.org/wiki/kalonji#English

The genus name Nigella is a diminutive of the Latin niger (black), referring to the seeds.
https://en.wikipedia.org/wiki/Nigella_sativa

Nigel: English form of Latin Nigellus, from nigellus, diminutive of niger (“black”), used in the Middle Ages to Latinize Norman Néel or Gaelic Neil.
https://en.wiktionary.org/wiki/Nigel#English

nigellus: somewhat black
[English Descendants: Nigel, nigella, niello]
https://en.wiktionary.org/wiki/nigellus#Latin

nigress (plural nigresses): Alternative spelling of niggeress
https://en.wiktionary.org/wiki/nigress

niggeress (plural niggeresses, masculine nigger): (dated, offensive) A black woman; a negress.
https://en.wiktionary.org/wiki/niggeress#English

nigritia: blackness
nigritias: blacknesses

nigritia f (genitive nigritiae); first declension: blackness, black color

Case Singular Plural
nominative nigritia nigritiae
genitive nigritiae nigritiārum
dative nigritiae nigritiīs
accusative nigritiam nigritiās
ablative nigritiā nigritiīs
vocative nigritia nigritiae
https://en.wiktionary.org/wiki/nigritia#Latin

More:

niger (feminine nigra, neuter nigrum); first/second declension: 1. shining black (as opposed to ater, dull black): Nigrum in candida vertere. To turn black into white. 2. bad; evil; ill-omened
Synonyms (black): fuscus
Antonyms (shining white): candidus
Number Singular Plural
Case / Gender Masculine Feminine Neuter Masculine Feminine Neuter
nominative niger nigra nigrum nigrī nigrae nigra
genitive nigrī nigrae nigrī nigrōrum nigrārum nigrōrum
dative nigrō nigrae nigrō nigrīs nigrīs nigrīs
accusative nigrum nigram nigrum nigrōs nigrās nigra
ablative nigrō nigrā nigrō nigrīs nigrīs nigrīs
vocative niger nigra nigrum nigrī nigrae nigra
https://en.wiktionary.org/wiki/niger

A double diminutive: negro -> negrito -> negritito
A double diminutive: negra -> negrita -> negritita
https://en.wikipedia.org/wiki/Diminutive#Romance_languages
https://en.wikipedia.org/wiki/List_of_diminutives_by_language

fuscus (feminine fusca, neuter fuscum); first/second declension: 1. dark, black 2. (of the voice) husky, hoarse
Number Singular Plural
Case / Gender Masculine Feminine Neuter Masculine Feminine Neuter
nominative fuscus fusca fuscum fuscī fuscae fusca
genitive fuscī fuscae fuscī fuscōrum fuscārum fuscōrum
dative fuscō fuscae fuscō fuscīs fuscīs fuscīs
accusative fuscum fuscam fuscum fuscōs fuscās fusca
ablative fuscō fuscā fuscō fuscīs fuscīs fuscīs
vocative fusce fusca fuscum fuscī fuscae fusca
https://en.wiktionary.org/wiki/fuscus#Latin

cigar -> cigar+ette/cigar+et
smurf -> smurf+ette
blackamoor (blackamore, blackemore, blackemoor) -> blackamoor+ette/blackamoor+et

Always wondered why diminutiveness was not used in naming 'The little matchbox girl seller'?! 'Girl' along with 'little' could be fused into 'seller+ette/seller+et', thus, 'The matchbox sellerette'. Love it.

The ripper program used below is included in Schizandrafield corpus package, few posts above. The idea is everyone to have the independence and opportunity to rip ... whatever.



Needless to say, yet, currently I am obsessed by the 'black' derivatives, after all, the corpus' name sounds to me as 'the little black one', love it.

He learns not to learn and reverts to what all men pass by.
Sanmayce
Posted: Wednesday, July 12, 2017 12:47:05 PM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
Excuse me for the long post but it contains all the stuff that this thread is all about - giving some control over words'n'phrases with help of computers.

Finally, the first known to me phrase-checker is freely available.
Just point your file press a button and your phrase-checked file is autoloaded into NOTEPAD, more simple than that I cannot fathom.

Time for one fully-functional English language e-assistant.

Wanted to put the scattered pieces-of-usefulness under one roof, in form of one package - all-in-one type:
All the files in the package, here.
Here comes _GW.7z 6,426,784,324 bytes (uncompressed: 34,089,774,935 bytes) long, featuring:

1] 32bit GUI Gallowwalker revision 3+ shell, able to phrase-check against 3-grams and 5-grams;
2] 3-grams provided are with 4 occurrences or more within Gamera Corpus r.19: 112,878,788;
3] 5-grams provided are with 4 occurrences or more within Gamera Corpus r.19: 141,736,497;
4] creating PAGODA files, just with entering a word and hitting the wide button below it;
5] Nigella Corpus included, the 2-gram corpus (313,274,731 distinct bixgrams strong) derived from English Wikipedia 01-Jan-2017;
6] 1-gram corpus 'Schizandrafield_Corpus_(64869182_unique-1-grams).wrd';
7] The ten 1-gram corpora used in merging into 6] are given in the 'Corpora_wordlists_only_to_build_Schizandrafield_Corpus' folder;
8] C sources, of all console tools used, are given in the 'Sources_PDF+DOC' folder.

Of course, full-text queries in all three (exact, wildcard, fuzzy) search modes are one-button away.
For more info on 4], see here.

The superbness of Gallowwalker GUI shell is in its ... shellishness (after 'hellishness'), that is, serving as an invoking platform while not doing anything except executing console executables - all the work is done outwith (outside it).
In fact, under each button a console tool is assigned, all console executables/tools used, are written in C and fully portable to *nix OSes, so the design is only the GUI to be rewritten in the future to conform to the specific API. To me, the 'classic theme' within Windows XP/7 is quite good while the new themes betray the old artistic successes.

Next three screenshots show how 3-gramming and 5-gramming is one button click away, some 20 seconds in total to phrase-check an excerpt from an article about most crucial tank battle that happened today back in 1943.







Below, the Console/(Command Prompt) log is given for the whole ripping, some statistics:
- Total bixgrams/(2-x-grams) seen in 'enwiki-20170101-pages-articles.xml' are 6,304,429,578;
- Total distinct bixgrams/(2-x-grams): 313,274,731;
- Total memory needed for one pass: 26,890,698KB;

So, instead of waiting 463,774 seconds the 512 passes to complete you may rip in one pass if you have 27GB of fast virtual RAM (SSD based) or even better 32GB of nonvirtual.
The log:

D:\rip>dir

01/07/2017 04:22 AM 60,182,193,037 enwiki-20170101-pages-articles.xml
06/11/2016 01:41 AM 133,632 Leprechaun_x-leton_32bit_Intel_01_512p.exe
06/11/2016 01:41 AM 137,216 Leprechaun_x-leton_32bit_Intel_02_512p.exe
06/11/2016 01:41 AM 137,728 Leprechaun_x-leton_32bit_Intel_03_512p.exe
06/11/2016 01:41 AM 136,192 Leprechaun_x-leton_32bit_Intel_04_512p.exe
06/11/2016 01:41 AM 137,728 Leprechaun_x-leton_32bit_Intel_05_512p.exe

D:\rip>dir enwiki-20170101-pages-articles.xml/b >enwiki-20170101-pages-articles.xml.lst

D:\rip>Leprechaun_x-leton_32bit_Intel_02_512p.exe enwiki-20170101-pages-articles.xml.lst enwiki-20170101-pages-articles.xml.02 1300123 Y
Leprechaun_doubleton (Fast-In-Future Greedy n-gram-Ripper), rev. 16FIXFIXfixfix, written by Svalqyatchx.
Purpose: Rips all distinct 2-grams (2-word phrases) with length 5..41 chars from incoming texts.
Feature1: All words within x-lets/n-grams are in range 1..31 chars inclusive.
Feature2: In this revision 512MB 1-way hash is used which results in 67,108,864 external B-Trees of order 3.
Feature3: In this revision, 512 passes are to be made.
Feature4: If the external memory has latency 99+microseconds then !(look no further), IOPS(seek-time) rules.
Pass #1 of 512:
Size of input file with files for Leprechauning: 36
Allocating HASH memory 536,870,977 bytes ... OK
Allocating memory 1270MB ... OK
Size of Input TEXTual file: 60,182,193,037
-; 07,059,831P/s; Phrase count: 6,304,429,578 of them 611,580 distinct; Done: 64/64
Bytes per second performance: 67,393,273B/s
Phrases per second performance: 7,059,831P/s
Time for putting phrases into trees: 893 second(s)
Flushing UNsorted phrases: 100%; Shaking trees performance: 01,223,160P/s
Time for shaking phrases from trees: 1 second(s)
Leprechaun: Current pass done.
...
Pass #512 of 512:
Size of input file with files for Leprechauning: 36
Allocating HASH memory 536,870,977 bytes ... OK
Allocating memory 1270MB ... OK
Size of Input TEXTual file: 60,182,193,037
-; 06,867,570P/s; Phrase count: 6,304,429,578 of them 613,044 distinct; Done: 64/64
Bytes per second performance: 65,557,944B/s
Phrases per second performance: 6,867,570P/s
Time for putting phrases into trees: 918 second(s)
Flushing UNsorted phrases: 100%; Shaking trees performance: 01,226,088P/s
Time for shaking phrases from trees: 1 second(s)
Leprechaun: Current pass done.

Total memory needed for one pass: 26,890,698KB
Total distinct phrases: 313,274,731
Total time: 463774 second(s)
Total performance: 13,593P/s i.e. phrases per second
Leprechaun: Done.

D:\rip>sort /+10 /M 1012012 enwiki-20170101-pages-articles.xml.02 /O enwiki-20170101-pages-articles.xml.02.txt

D:\rip>dir

01/07/2017 04:22 AM 60,182,193,037 enwiki-20170101-pages-articles.xml
07/11/2017 06:23 AM 7,886,600,664 enwiki-20170101-pages-articles.xml.02
07/11/2017 05:22 PM 7,886,600,664 enwiki-20170101-pages-articles.xml.02.txt
07/05/2017 09:33 PM 36 enwiki-20170101-pages-articles.xml.lst
06/11/2016 01:41 AM 133,632 Leprechaun_x-leton_32bit_Intel_01_512p.exe
06/11/2016 01:41 AM 137,216 Leprechaun_x-leton_32bit_Intel_02_512p.exe
06/11/2016 01:41 AM 137,728 Leprechaun_x-leton_32bit_Intel_03_512p.exe
06/11/2016 01:41 AM 136,192 Leprechaun_x-leton_32bit_Intel_04_512p.exe
06/11/2016 01:41 AM 137,728 Leprechaun_x-leton_32bit_Intel_05_512p.exe

D:\rip>




Well, let us see what are all the words at the right side of 'nigella':










[`nigella_*] 0,000,573 nigella_lawson /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,185 nigella_sativa /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,051 nigella_damascena /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,040 nigella_lt /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,040 nigella_bites /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,039 nigella_saunders /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,036 nigella_s /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,031 nigella_seeds /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,024 nigella_express /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,017 nigella_lawsons /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,015 nigella_title /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,014 nigella_seed /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,014 nigella_feasts /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,013 nigella_is /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,012 nigella_k /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,011 nigella_binomial /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,009 nigella_arvensis /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,008 nigella_and /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,008 nigella_amp /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,006 nigella_quot /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,006 nigella_nigella /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,006 nigella_lucy /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,006 nigella_gt /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,006 nigella_bittleston /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,005 nigella_in /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,004 nigella_work /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,004 nigella_was /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,004 nigella_revision /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,004 nigella_kitchen /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,004 nigella_http /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,004 nigella_hms /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,004 nigella_has /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,004 nigella_category /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,004 nigella_biotech /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,003 nigella_text /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,003 nigella_orientalis /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,003 nigella_or /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,003 nigella_hillgarth /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,003 nigella_detail /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,003 nigella_christmas /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_week /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_url /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_twd /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_tree /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_tells /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_species /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_serves /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_revealed /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_regnum /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_recovery /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_present /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_photographer /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_mason /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_l /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_kitchenbbc /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_joins /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_intervention /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_hispanica /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_genus /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_excess /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_episodes /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_elliptio /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_e /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_disambiguation /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_didnt /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_comment /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_ciliaris /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_by /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_also /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_wins /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_will /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_u /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_type /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_tv /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_tops /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_to /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_the /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_that /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_tastes /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_subspecies /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_subsequently /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_stir /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_sp /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_sold /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_smaller /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_sexy /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_sends /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_selected /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_segetalis /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_searchtv /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_ryan /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_rusa /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_retseptid /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_recipe /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_quote /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_quick /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_puri /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_profile /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_phyllostachys /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_papillosa /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_over /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_oregano /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_ononis /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_nutmeg /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_nipponanthemum /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_nigellastrum /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_mordellistena /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_missing /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_meshnumber /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_meets /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_may /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_m /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_lowland /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_look /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_long /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_linnavuori /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_let /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_leaves /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_last /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_kolanji /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_kalonji /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_jekyll /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_itv /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_interview /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_integrifolia /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_image /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_iambia /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_https /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_his /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_hampson /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_glandulifera /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_gets /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_gaiman /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_gabby /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_from /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_fr /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_fox /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_fighting /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_female /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_fails /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_explains /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_except /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_episode /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_epioblasma /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_ekspress /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_effect /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_dyspyralis /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_during /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_drew /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_dishes /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_date /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_d /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_cretica /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_cooked /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_claims /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_carom /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_bff /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_belmont /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_authorlink /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_au /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_article /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_arcyria /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_ampsud /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_align /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_admits /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,001 nigella_a /enwiki-20170101-pages-articles.xml.02.txt/

I see at least eight unseen (regarding aforementioned 4 ones) species, 'nigella_arvensis', 'nigella_hispanica', 'nigella_segetalis', 'nigella_phyllostachys', 'nigella_papillosa', 'nigella_ononis', 'nigella_dyspyralis', 'nigella_arcyria', ...

[`nigella_*] 0,000,185 nigella_sativa /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,051 nigella_damascena /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,003 nigella_orientalis /enwiki-20170101-pages-articles.xml.02.txt/
[`nigella_*] 0,000,002 nigella_ciliaris /enwiki-20170101-pages-articles.xml.02.txt/

Ugh, Wikipedia doesn't cover all the Nigella species, more appear at:
Source: https://species.wikimedia.org/wiki/Nigella

Nigella elata
Nigella fumariifolia
Nigella glandulifera
Nigella lancifolia
Nigella oxypetala
Nigella stellaris
Nigella turcica

Another mania arises, to enlist all the Nigella species, have to rip and add the:
Source: https://dumps.wikimedia.org/specieswiki/20170701/
specieswiki-20170701-pages-articles.xml (942,541,119 bytes)

When looking at the left of 2-grams 'nigella' some designate insects already as 'megachile_nigella', tried to get a picture but to no avail, could anyone provide a picture? Maybe this bee followed the Dodo, or just no one cared to get a picture of it.

"The genus Megachile is a cosmopolitan group of solitary bees, often called leafcutter bees or leafcutting bees. While other genera within the family Megachilidae may chew leaves or petals into fragments to build their nests, certain species within Megachile neatly cut pieces of leaves or petals, hence their common name. This is one of the largest genera of bees, with almost 1500 species[1] in over 50 subgenera."
Source: https://en.wikipedia.org/wiki/Megachile

Only 1500 'Megachile' 2-grams, yippee.

Oh, love these tiers:

Taxonomic Hierarchy

Kingdom Animalia – Animal, animaux, animals
Subkingdom Bilateria
Infrakingdom Protostomia
Superphylum Ecdysozoa
Phylum Arthropoda – Artrópode, arthropodes, arthropods
Subphylum Hexapoda – hexapods
Class Insecta – insects, hexapoda, inseto, insectes
Subclass Pterygota – insects ailés, winged insects
Infraclass Neoptera – modern, wing-folding insects
Superorder Holometabola
Order Hymenoptera – abelha, formiga, vespa, ants, bees, wasps
Suborder Apocrita – abeilles, fourmis, guêpes véritables, narrow-waisted hymenopterans, ants, bees, true wasps
Infraorder Aculeata
Superfamily Apoidea – bees, sphecoid wasps, apoid wasps
Family Megachilidae – leafcutting bees
Subfamily Megachilinae
Tribe Megachilini
Genus Megachile Latreille, 1802
Species Megachile nigella Vachal, 1908

Source: ITIS, the Integrated Taxonomic Information System: https://www.itis.gov/

Crazy! Who can enlist all the bees in a ... beelist?
"There are nearly 20,000 known species of bees in seven recognized biological families."
https://en.wikipedia.org/wiki/Bee

As the cliche goes, there is more than/that meets the eye...

Should you encounter a problem, just ask me at sanmayce@sanmayce.com, gladly will ... assist. Enfun!

Ah, and one particular thing that makes me happy, AMD with their upcoming (two weeks away) desktop monster 'Threadripper' CPU will pair breathtakingly with the multi-threaded searcher Kazahana used in the above package - 16 threads executed by 16 cores - all exact/wildcard/fuzzy queries will fire on all cylinders.

He learns not to learn and reverts to what all men pass by.
Sanmayce
Posted: Friday, July 14, 2017 5:38:42 AM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
Reuploaded the package (Masakari revision 7) because it appeared nondownloadable (due to GoogleDrive dependencies, cookies and what not) via some browsers, also made it smaller by using most aggressive compression options thus the 6GB became:

_GW.7z 4.86 GB (5,218,801,759 bytes):
or the direct link
https://drive.google.com/file/d/0BzKgu_YpO6uZZUVkTDBvdVhhMjg/view?usp=sharing

dotNetFx40_Full_x86_x64.zip:
or the direct link
https://drive.google.com/file/d/0BzKgu_YpO6uZTUVXQWRYVHRNMEk/view?usp=sharing

The second file is needed on some bare Windows installations where dotNet extension is not pre-installed.

To remind what the spirit of Masakari project is:

"The animistic tradition from ancient times state that deities descend to and reside in the mountains. For lumbermen, the mountain was therefore a sacred territory which required strict ritual abstentions to be entered. The ax has been closely related with this religious revering of the mountain and its trees. For example, the first act amongst the myriad of Shinto rituals carried out before the lumbering for the rebuilding of the Ise Shrine every 20 years, is the cutting into a tree with a ritually purified ax (imi-ono). Moreover in the festival of the pillar (Onbashira-matsuri) at the Suwa shrine, a vermillion-lacquered ax is used to cut down a tree which is to become the sacred pillar.
In Buddhist symbolism the ax also acquires the power of cutting off evil, and there are numerous existing statues of bodhisattva holding axes. Shugen-do, a traditional Japanese religion born out of an amalgam of different religions including Shintoism and Buddhism which has a particular connection with mountains, regards the ax as one of the symbolic objects to be carried by practitioners when going into mountains for ascetic training."


In my eyes, Masakari - The Free Phrase-Checker, comes to cut off big slices of ignorance in English wording/phrasing. As Japanese say 'purified ax' I would extend the thought to 'purified ax purifying' or just 'purificator', wow, this word is already heavily loaded with similar notion:
"PURIFICATOR - A small piece of white linen, marked with a cross in the center, used by the priest in the celebration of Mass."

Simply, the package is intended to purify phrasing within English phraseology.

My immediate intention is to rip the 3-grams from Wikipedia as an add-on to the above package, I had to buy the $69.99 OneDrive upgrade to gain 1000GB storage, then the first 3-gram Wikipedia phrase-checker will be freely available to everyone. Having it, will allow to compare your texts to/with Wikipedia's "style" and see the rankings of each phrase. And, insanity doesn't stop there, after completing Gamera Corpus r.28 enrichment the craze will leave the milliondom and enter billiondom, that is, transitioning from 9 digit to 10 digit corpora of 3-grams. Once having such a corpus, the user will be able to answer how a given phrase is ranked within the tri-x-gram-billiondom.

He learns not to learn and reverts to what all men pass by.
Sanmayce
Posted: Saturday, September 16, 2017 9:34:41 AM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
Finished building English Wikipedia phrase-check corpus.
Slowly but surely have been reaching for a final package allowing Phrase-Checking against the whole English Wikipedia, meaning, using its phrases as "checkbase".

After all the final preparations will do an walkthrough, wanna show how beautiful phrasing as 'very_impressionable' and "Kazehana_s_mannerisms' could be spotted and enrich one's vocabulary.
"Kusano (草野, Sekirei #108), commonly referred to as "Kuu-chan" or "Ku", is the youngest of Minato's Sekirei, and is also known as the "Green Girl" (緑の少女 Midori no Shōjo) by other Sekirei. At the beginning of the story, she was hiding in a botanical garden after being traumatized when Mikogami attempted to forcibly wing her. Kusano communicated with Minato telepathically and led him through the garden until he found her. Kusano refers to Minato as Onii-chan (big brother), and is the most attached to him. She does not like fighting or quarreling and she can be seen stopping them when they start. She is also very impressionable and often copies Musubi, Tsukiumi and Kazehana's mannerisms. She is extremely determined to be Minato's wife when she grows up, and is highly possessive of him at times, ..."
An excerpt from https://en.wikipedia.org/wiki/List_of_Sekirei_characters

The corpus that will be used as checkbase.



Also, added the needed buttons (feature of revision 4). The purple buttons on the bottom-right will check the file selected (the inverse darkblue line):





Wikimedia Foundation, Inc. should rethink the usefulness of their search options.
After finishing the package will upload it somewhere, if you have 50GB shareable on your Internet drive, please cooperate to upload it to your drive.

He learns not to learn and reverts to what all men pass by.
Sanmayce
Posted: Friday, January 19, 2018 3:55:19 AM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
Wanna share my view on the pronunciation aspect, thought about creating a separate thread, however the common ground with this master-thread is enough, so here something unseen on Internet or bookstores comes...

Transcriptiondom - Deep Dive into American English Transcriptions

My attempt to cover this fundamental topic follows.
First off, the approach taken inhere is based entirely on RHW (Random House Webster), love it.

This is the superb RHWUD:



The idea behind this thread is to exhaustively write down all the phonemes (American English) in use.
So feel free to jump in and share your view on the matter, bug-fixing is highly appreciated.

I've seen many dictionaries' legends (transcription schemes) showing all kinds of special symbols, and never liked the inconsistency and the lack of uniformity, my core idea is to address and solve this issue by offering one simplified set of symbols - all writeable with present 26 English letters.
In ancient times, pre-Sanskrit, there was a mystical law/rule the written language to correspond with the spoken language directly, without the need of "recoding" i.e. transcriptions, in my eyes, the need for such mediator shows the deviation from this beautiful and sacred principle. Nowadays, English is especially "sick" in that department as opposed to my native Bulgarian, we still preserve (mostly) the unity of what-is-spoken-is-what-is-written. Therefore I consider my language superior in that regard.
So, using Random House Webster transcription scheme as a backbone and seeing the need for a reference name, here comes RHWSPA, it stands for Random-House-Webster-Sanmayce-Phonetic-Alphabet. IPA or International-Phonetic-Alphabet, suffers from the inability to "encode" phonemes with vanilla ASCII symbols. On top of that I wanted American (not British) variants all the way.

My passion, or rather obsession was harnessed and after some 4-5 months of grabbing the original electronic edition of RHW the outcome was 67,466 word-transcription pairs. Since my kidhood my perception of knowledge has been in spirit all-for-the-people, thought that world was a place/workshop for constant enrichment where the goal was to simply appreciate, it turned out the pimpish way was imposed on many activities using the pay-me-these-many-cents-to-give-you-these-many-letters formula. The result is shameful, 7 billion people (2 of which are using English) have no available free resource/reference to look up the basics of English - the spoken part - utter shame! What planet is this!?
I can hear pimps on left and right saying "this is the business model, get over it", guess I am the last of the fools. Or, to sugarcoat it, the last of the freedomists.



A page from the corpus:



Right on, right into the list, later my quick remarks will be given.

Legend:
{} - a phoneme enclosed in curly brackets
Primary Stress {'} follows the stressed syllable
~ - tilde sign precedes the vowel which becomes shortened, a diacritic in fact

English alphabet:

a|({e~i}), n., pl. A's or As, a's or as.
b|(b{i:}), n., pl. B's or Bs, b's or bs.
c|(s{i:}), n., pl. C's or Cs, c's or cs.
d|(d{i:}), n., pl. D's or Ds, d's or ds.
e|({i:}), n., pl. E's or Es, e's or es.
f|(ef), n., pl. F's or Fs, f's or fs.
g|(j{i:}), n., pl. G's or Gs, g's or gs.
h|({e~i}ch), n., pl. H's or Hs, h's or hs.
i|({a~i}), n., pl. I's or Is, i's or is.
j|(j{e~i}), n., pl. J's or Js, j's or js.
k|(k{e~i}), n., pl. K's or Ks, k's or ks.
l|(el), n., pl. L's or Ls, l's or ls.
m|(em), n., pl. M's or Ms, m's or ms.
n|(en), n., pl. N's or Ns, n's or ns.
o|({o~u}), n., pl. O's or Os; o's or os or oes.
p|(p{i:}), n., pl. P's or Ps, p's or ps.
q|(ky{u:}), n., pl. Q's or Qs, q's or qs.
r|({a:}r), n., pl. R's or Rs, r's or rs.
s|(es), n., pl. S's
t|(t{i:}), n., pl. T's
u|(y{u:}), n., pl. U's or Us, u's or us.
v|(v{i:}), n., pl. V's or Vs, v's or vs.
w|(dub{'}{x}l y{u:}{"}, -y{~ou}; rapidly dub{'}y{x}), n., pl. W's or Ws, w's or ws.
x|(eks), n., pl. X's or Xs, x's or xs.
y|(w{a~i}), n., pl. Y's
z|(z{i:} or, esp. Brit., zed; Archaic iz{'}{x}rd), n., pl. Z's or Zs, z's or zs.

Random House Webster Dictionary Transcription Phonemes:

a reads {e~e} or {~ae} when primary stressed
abacist|(ab{'}{x} sist), n.
man|(man), n., pl. men,
glad|(glad), adj., gladder, gladdest, v., gladded, gladding.

{e:x} reads {e~e}{x}
air|({e:x}r), n.
area|({e:x}r{'}{i:} {x}), n.
aware|({x} w{e:x}r{'}), adj.

{x} reads {x}
ago|({x} g{o~u}{'}), adj.
away|({x} w{e~i}{'}), adv.
awesome|({o:}{'}s{x}m), adj.

{x_} reads {x} or omit
inspired|(in sp{a~i}{x_}rd{'}), adj.
aspire|({x} sp{a~i}{x_}r{'}), v.i., -pired, -piring.
require|(ri kw{a~i}{x_}r{'}), v., -quired, -quiring.

{x:} reads {x~x}
murder|(m{x:}r{'}d{x}r), n.
absurd|(ab s{x:}rd{'}, -z{x:}rd{'}), adj.
adverb|(ad{'}v{x:}rb), n. Gram.

{u:} reads {u~u}
tool|(t{u:}l), n.
attitude|(at{'}i t{u:}d{"}, -ty{u:}d{"}), n.
avenue|(av{'}{x} ny{u:}{"}, -n{u:}{"}), n.

{a:} reads {a~a}
car|(k{a:}r), n.
fast|(fast, f{a:}st), adj., -er, -est, adv., -er, -est, n.
abaca|(ab{"}{x} k{a:}{'}, {a:}{"}b{x}-), n.

{i:} reads {i~i}
meet|(m{i:}t), v., met, meeting, n.
abbreviated|({x} br{i:}{'}v{i:} {e~i}{"}tid), adj.
linear|(lin{'}{i:} {x}r), adj.

{o:} reads {o~o}
all|({o:}l), adj.
almost|({o:}l{'}m{o~u}st, {o:}l m{o~u}st{'}), adv.
also|({o:}l{'}s{o~u}), adv.
assault|({x} s{o:}lt{'}), n.
automatic|({o:}{"}t{x} mat{'}ik), adj.

{~ou} reads {~ou}
look|(l{~ou}k), v.i.
pull|(p{~ou}l), v.t.
cure|(ky{~ou}r), n., v., cured, curing.
book|(b{~ou}k), n.
fury|(fy{~ou}r{'}{i:}), n., pl. -ries.
sure|(sh{~ou}r, sh{x:}r), adj., surer, surest, adv.

{a~i} reads {a~i}
ice|({a~i}s), n., v., iced, icing, adj.
lionheart|(l{a~i}{'}{x}n h{a:}rt{"}), n.
lively|(l{a~i}v{'}l{i:}), adj., -lier, -liest, adv.

{o~u} reads {o~u}
tone|(t{o~u}n), n., v., toned, toning.
adore|({x} d{o:}r{'}, {x} d{o~u}r{'}), v., adored, adoring.
loaded|(l{o~u}{'}did), adj.

{e~i} reads {e~i}
ape|({e~i}p), n., v., aped, aping.
able|({e~i}{'}b{x}l), adj., abler, ablest, n.
daresay|(d{e:x}r{'}s{e~i}{'}), v.i., v.t.

{dh} reads {dd}
then|({dh}en), adv.
although|({o:}l {dh}{o~u}{'}), conj.
another|({x} nu{dh}{'}{x}r), adj.

th reads {tt}
think|(thingk), v., thought, thinking, adj., n.
truth|(tr{u:}th), n., pl. truths (tr{u:}{dh}z, tr{u:}ths).
strength|(strengkth, strength, strenth), n.

ng reads {nn}
drinking|(dring{'}king), adj.
bethink|(bi thingk{'}), v., -thought, -thinking.
function|(fungk{'}sh{x}n), n.
sing|(sing), v., sang or, often, sung; sung; singing; n.

u reads {~xa}
sun|(sun), n., v., sunned, sunning.
uncut|(un kut{'}), adj.
underdog|(un{'}d{x}r d{o:}g{"}, -dog{"}), n.

ou reads {a~u}
out|(out), adv.
about|({x} bout{'}), prep.
empower|(em pou{'}{x}r), v.t.

oi reads {o~i}
voice|(vois), n., v., voiced, voicing, adj.
avoid|({x} void{'}), v.t.
tomboy|(tom{'}boi{"}), n.

y reads {~i}
yes|(yes), adv., n., pl. yeses, v., yessed, yessing, interj.
strenuous|(stren{'}y{u:} {x}s), adj.
year|(y{i:}r), n.

w reads {~u}
wow|(wou), Informal.
would|(w{~ou}d; unstressed w{x}d), v. would (w{o~u}ld), n.
twist|(twist), v.t.

j reads {dj}
jester|(jes{'}t{x}r), n.
justice|(jus{'}tis), n.
badge|(baj), n., v., badged, badging.

zh reads {j}
revision|(ri vizh{'}{x}n), n.
displeasure|(dis plezh{'}{x}r), n., v., -ured, -uring.
version|(v{x:}r{'}zh{x}n, -sh{x}n), n.

ch reads {tch}
check|(chek), v., n., pl. checks or, for 45, chex, adj., interj.
church|(ch{x:}rch), n.
dispatch|(di spach{'}), v.t.

sh reads {sch}
thrash|(thrash), v.t.
accomplished|({x} kom{'}plisht), adj.
chef|(shef), n.

{kh} reads {kh}
claught|(kl{o:}{kh}t, kl{a:}{kh}t), v.
loch|(lok, lo{kh}), n. Scot.
reich|(r{a~i}k; Ger. R{a~i}{kh}), n.

Yesterday I liked the directness of one Korean fellow, his attempt to show the phonemic approach instantly caught my attention since this is the right way.

https://www.youtube.com/watch?v=4l2kXoNIzJE

His example "cats and dogs" is transcribed in Random-House Webster style as:
k{e~e}ts {e~e}nd d{o~o}gs

My question, can you show an English word having a phoneme outside the enlisted ones?

Oh, and the precious corpus itself: https://twitter.com/Sanmayce/status/954266258677092352

He learns not to learn and reverts to what all men pass by.
Sanmayce
Posted: Saturday, March 17, 2018 3:56:25 AM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
Hold on to your hat, here comes enlightening the richest English tagged-wordlist...



After undergoing major enrichment, in form of 30 more corpora and tagging each word with the respective abbreviature of corpus' name (and occurrences within this corpus), the package is downloadable from one of my Internet drives:
_GW_Schizandrafield_Corpus_revision_C.7z 1.69 GB (1,820,740,465 bytes)
Schizandrafield.pdf 712 KB (729,272 bytes)

Quick walk-through for finding words formed by -en suffix and en- prefix, wanted to know how en-light-en resonates with en-height-en, it turns out one of the corpora - the Google Books corpus (dump from 2013) featuring 7+ million distinct words - doesn't have the word 'en-height-en', it appears in a book (the screenshot #2) from Google Books project but after 2013. This word appears only once in the 2+TB JSON dump of reddit.com, so two independent sources "validate" it.

-en
suffix forming verbs
cause to be; become; cause to have: blacken; heighten.
[Old English -n-, as in fæst-n-ian to fasten, of common Germanic origin; compare Icelandic fastna]
https://www.thefreedictionary.com/-en

Prefix
en-
1. in, into, on, onto
2. covered
3. caused
4. as an intensifier
https://en.wiktionary.org/wiki/en-

What would be the meaning of enheighten?

These 'en*en' words deserve further investigation, they have the potential to en-strength-en i.e. to draw the attention by double-highlighting or double-highlightening.

Step #1: Select the file you need to search into:



Step #2: Enter your wildcards and pattern at top, then Single-click on the button 'Sensitive search with wildcards':



Step #3: Double-click on the resultant file 'Kazahana.txt':



The whole idea is wordsmiths to have one master vocabulary repository serving as look-up sidekick/assistant, whenever a non-common word interests me, looking up the tagged revision C is to give ... a usage across major diverse vocabulary fields.

He learns not to learn and reverts to what all men pass by.
Jyrkkä Jätkä
Posted: Saturday, March 17, 2018 7:56:25 AM

Rank: Advanced Member

Joined: 9/21/2009
Posts: 41,647
Neurons: 393,585
Location: Helsinki, Southern Finland Province, Finland
I just wonder when you start examining the Finnish vocabulary ;-)


In the beginning there was nothing, which exploded.
Sanmayce
Posted: Saturday, March 17, 2018 4:14:17 PM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
Jyrkkä Jätkä wrote:
I just wonder when you start examining the Finnish vocabulary ;-)


Hee-hee, nice catch, not having all Latin-based languages is unacceptable - to be addressed in revision D. The things that stopped me from exhaustively covering them, are the desire to have/use only a-z when forming English words and the initial choice to rip only the Latin subset of ASCII coding, expanding to UTF-8 is a possibility, thus diacritics will no longer truncate the original words into only-A-to-Z sequences.

Currently, your name Jyrkkä Jätkä is ripped as three words jyrkk/j/tk instead as two.

My wish is, when I enter writing mode to have a fulcrum (thus is called one superb Russian air interceptor/supporter), to have a master heavy-duty vocabulary wordlist, too often words play games of their own, I want to play too by intercepting their transitions.

He learns not to learn and reverts to what all men pass by.
leonAzul
Posted: Wednesday, March 21, 2018 4:33:49 AM

Rank: Advanced Member

Joined: 8/11/2011
Posts: 8,349
Neurons: 26,527
Location: Miami, Florida, United States
Sanmayce wrote:
Jyrkkä Jätkä wrote:
I just wonder when you start examining the Finnish vocabulary ;-)


Hee-hee, nice catch, not having all Latin-based languages is unacceptable - to be addressed in revision D. The things that stopped me from exhaustively covering them, are the desire to have/use only a-z when forming English words and the initial choice to rip only the Latin subset of ASCII coding, expanding to UTF-8 is a possibility, thus diacritics will no longer truncate the original words into only-A-to-Z sequences.

Currently, your name Jyrkkä Jätkä is ripped as three words jyrkk/j/tk instead as two.

My wish is, when I enter writing mode to have a fulcrum (thus is called one superb Russian air interceptor/supporter), to have a master heavy-duty vocabulary wordlist, too often words play games of their own, I want to play too by intercepting their transitions.


UTF-16 is almost acceptably complete. At least UTF-8 comprises European scripts well, yet UTF-16 can handle the most salient Asian, African, and native American scripts. What makes UTF-XX text encodings so worthwhile is that they clarify all the murkiness of the ISO variants of ASCII that haunted the 1990s. Think



"Make it go away, Mrs Whatsit," he whispered. "Make it go away. It's evil."
Sanmayce
Posted: Wednesday, March 21, 2018 7:29:00 AM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
Thanks for the hint, once got scared while looking into different UTF schemes, it is hard for me to see the whole picture, maybe somebody else could share how to rip all words in all languages, currently I am satisfied with a-z words. Lacking German lexicons in their entirety makes me unhappy, though. Yet, the way I see a master English unigram corpus, as Schizandrafield, is still within the 26 English letters, sticking to the best-known English pangram:
"The quick brown fox jumps over the lazy dog."
a b c d e f g h i j k l m n o p q r s t u v w x y z


Really like the Germanic boldness, especially in forming long words, for example 'generalfeldmarschall', pure beauty! Only next 4 rascals/scamps that brake the sequencing:
"The letters in the German alphabet are the same as in English; however, there are four more letters which you will come across in the German language: ä, ö, ü and ß. However, these extra four letters are not part of the alphabet. Once you are familiar with the pronunciation of the German language you will find German can be spoken quite smoothly without using too much spit and harsh, abrupt endings!"

leonAzul, you are welcome to see my attempt to share (to return the favor in a way) with reddit.com community, FREE package: Richest English 1-gram corpus + Fastest Searcher.

To me, word formations (or rather geneses) have so much in common with the genetics, mixing and morphing of their sequences is amazing, whether the alphabet is GACT or abcdefghijklmnopqrstuvwxyz, for example the blend of whale +‎ dolphin words/animals wholphin, strange, too many people fail to see the connection - the dynamism in such "permutations":



The beauty:
"Second-generation wolphin female "Kawili Kai" (*23 December 2004) at Sea Life Park Hawaii, 9 months old. Offspring of first-generation wolphin female "Kekaimalu" (parents: "Tanui Hahai" Pseudorca crassidens ♂ x "Punahele" Tursiops truncatus ♀) and a male bottlenose dolphin (Tursiops truncatus)."

As far as I know, 'Kekaimalu' translates as "From the Peaceful Ocean", from Hawaiian?! Love such transliterations.

So many such pearls in Hawaiian language, all should be incorporated into master English vocabulary, no doubt about it.

...

KEIKILANI f Hawaiian
Means "heavenly child" or "royal child" from Hawaiian keiki "child" and lani "heaven, sky". This name was popular in Hawaii from 2000-2005.

KEILANI f Hawaiian
This name means "glorious sky" or "glorious heaven" from kei meaning "dignified, proud, glorious" and lani meaning "sky, heaven, heavenly, spiritual, royal, exalted, noble, aristocratic."

KEKAI m Hawaiian
This name means "the sea" from ke, which is a definite article, and kai meaning "sea, sea water."

KEKAPUHILIHINAPOHULANI f & m Hawaiian
The winding pathway to Heaven.

KEKINO m Hawaiian
Means Strength in Hawaiian

...


Source: https://www.behindthename.com/submit/names/usage/hawaiian


He learns not to learn and reverts to what all men pass by.
leonAzul
Posted: Wednesday, March 21, 2018 2:28:40 PM

Rank: Advanced Member

Joined: 8/11/2011
Posts: 8,349
Neurons: 26,527
Location: Miami, Florida, United States
Sanmayce wrote:

Really like the Germanic boldness, especially in forming long words, for example 'generalfeldmarschall', pure beauty! Only next 4 rascals/scamps that brake the sequencing:
"The letters in the German alphabet are the same as in English; however, there are four more letters which you will come across in the German language: ä, ö, ü and ß. However, these extra four letters are not part of the alphabet. Once you are familiar with the pronunciation of the German language you will find German can be spoken quite smoothly without using too much spit and harsh, abrupt endings!"



I would respectfully disagree with the notion that these are not part of the German alphabet, yet native speakers like IMcRout would be more expert on that issue. Each of these letters does have a standard Latin equivalent: "ae", "oe", "ue", "ss". Also there are slight differences in orthography in different German-speaking lands, with Austria notably preferring the use of hyphens to separate the individual components of longer concatenations.

Just the same, I note that you have arrived at a place that I predicted several years ago: the need to account for native patterns of composition when evaluating the usefulness of a particular permutation of letters as a viable word. Your statistical approach is not wrong, yet in my humble opinion you could achieve better utility by filtering the results of your n-grams based on observed patterns, thus weeding out the combinations that most native speakers would find nonsensical or identify as spelling errors.



"Make it go away, Mrs Whatsit," he whispered. "Make it go away. It's evil."
Sanmayce
Posted: Wednesday, March 21, 2018 3:54:07 PM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
>Just the same, I note that you have arrived at a place that I predicted several years ago: the need to account for native patterns of composition when evaluating the usefulness of a particular permutation of letters as a viable word. Your statistical approach is not wrong, yet in my humble opinion you could achieve better utility by filtering the results of your n-grams based on observed patterns, thus weeding out the combinations that most native speakers would find nonsensical or identify as spelling errors.

Yes, I remember, for far too long I stay in the only-statistical-report stage, wanna move forward and achieve better utility by applying some rank system, but I differentiate the two - stats could never be a wrong approach, ranking is another opera, it could go endlessly into deep Machine-Learning systems and be "filtered" by some super specialist. I am not on that level. My wish is to exhaust the most basic approaches first.

He learns not to learn and reverts to what all men pass by.
Sanmayce
Posted: Wednesday, March 28, 2018 10:54:13 AM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
While contemplating the beautiful techniques of comparing two sets of texts, the need for a simplistic, but powerful, utility came to mind. So, here comes Kamboocha - The Plagiarism Detector Liar

Why such name, for one, it has similar vibe as the colloquialism ‘Gotcha’ for ‘got you’.

In case you want to find the longest common chunk/phrase in two etetxs, the tool is downloadable from my Google drive, Benchmark_Mickey_VS_Mike_(Kamboocha_Intel_64).zip file, or:
Intel Developer Zone Forum

After looking for similarities in two biographical ebooks (Mike_Tyson_-_Undisputed_Truth_-_My_Autobiography and Mickey_Rourke_-_Wrestling_With_Demons), in the end the reported common string/phrase is:
wanted to be the center of attention.

The process of creating the matrix housing all common suffixes is maximally asymmetrical - supersimple yet superstrong - having it precomputed one can find any matches of orders from 1 to the LCSS i.e. the reported maximal value. FULL CONTROL, in two words. Love it. But the price is salty, the RAM footprint is beyond crazy, see the screenshot where I intend to kamboochaify two major resources/texts of Judaism:



This etude holds huge potential for analyses of English texts/phrases Boo hoo!

He learns not to learn and reverts to what all men pass by.
Sanmayce
Posted: Saturday, March 31, 2018 4:21:04 PM

Rank: Advanced Member

Joined: 5/29/2012
Posts: 331
Neurons: 15,702
Location: Sofia, Sofia-Capital, Bulgaria
Just found a superb etude dedicated to Plagiarism.

Published on Mar 23, 2018
Two students (Vocab Malone and Jon McCray) are caught by their philosophy professor (David Wood) copying a Wikipedia article for their papers. Can atheism help them avoid charges of plagiarism and imminent disciplinary action? Let's find out!

https://www.youtube.com/watch?v=yto4jXOOen8

...

Dr. Wood:
Hmm, I guess you have a point. If the Universe is a product of chance, I just have no basis of accusing you of plagiarism.

The hatman #1:
Thank you.

Dr. Wood:
Gentlemen, I'm afraid I own you an apology, not only I withdraw my accusation, I hereby declare that no matter how similar your future papers are to Wikipedia articles I can never accuse you of cheating because the odds of you writing the exact same words that Wikipedia author came up with, no matter how many times it happens are far better than the odds of Universe being able to sport life by chance, if I can believe that the Universe formed without the designer, I can believe absolutely anything.

The hatman #1:
Apology accepted.


Love it! Would transcribe the whole masterpiece, but my eyes are tired.
On Plagiarism note, the Kamboocha tool has been multi-threaded, to be made even faster soon.

He learns not to learn and reverts to what all men pass by.
Users browsing this topic
Guest


Forum Jump
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Main Forum RSS : RSS
Forum Terms and Guidelines | Privacy policy | Copyright © 2008-2018 Farlex, Inc. All rights reserved.