ircart/ircart/sort/kbal.rtf.txt

70 lines
12 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{\rtf1\ansi\ansicpg1256\deff0\deflang1031{\fonttbl{\f0\fnil\fcharset0 Calibri;}}
{\*\generator Msftedit 5.41.21.2509;}\viewkind4\uc1\pard\sa200\sl276\slmult1\qc\lang7\f0\fs36 kbal - knubbze's babbling algorithm\par
\pard\li720\sa200\sl276\slmult1\i\fs24 09:52:17 <@shirin> a still more striking example is nietzsches zarathustra where the author himself observed how one became two. \par
\pard\li720\sa200\sl276\slmult1\qr - sample output of a bot using \b kbal\b0 , it has been fed with mostly bullshit though \i0\par
\pard\sa200\sl276\slmult1 First off, I am not writing this document as some kind of scientific documentation on the kbal; this is my own set of guidelines that I used and am still using to perfect the pybabble bot. A lot of seemingly scientific terminology may be considered anything from apt to downright inaccurate or even complete garbage.\par
If you are interested in a direct application (and perhaps, in terms of completeness and coherence a more scientific approach to the algorithm) have a look at the source code of pybabble.\par
It should also be said, that this is not an actual algorithm for an intelligently talking artifical intelligence; but for one that babbles intelligable non-sense. \par
\b [Section 1: Terminology]\par
\b0 Throughout this document I will be referring to some basic data structures with a set of defined terms, those being: \i List\i0 , \i Reference\i0 , \i Dictionary\i0 , \i Queue\i0 and \i Set\i0 .\par
A \i List \i0 is a data structure that is akin to what is generally referred to as a \b\i vector\i0 \b0 in the world of computer science and information technology. It is a numerically indexed field of data. \par
A \i Reference\i0 is a data structure that is akin to a \b\i pair\b0\i0 structure. Unlike a pointer which is an unidirectional reference in the actual sense, a \i Reference\i0 in the name space of \i kbal\i0 is a direct bond between two objects.\par
A \i Dictionary\i0 is a data structure that combines \b\i associative arrays\b0\i0 , \i References \i0 and \i Lists\i0 . The contents of said dictionary may for instance be an arbitrarily indexed list of references of lists. This might sound complicated, but it will become easier to grasp as a concept later on.\par
The \i Queue\i0 type is basically the same as a list, with the difference that in theory it should not be iteratable, and no random access is granted except for \b\i reading \b0\i0 the \i front\i0 of the queue and \b\i writing\b0\i0 the \i back\i0 of the queue. Queues are just a mathematical or logical construct (\i gedankenexperiment\i0 , as it were) and are conveniently implemented as lists.\par
And last but not least, the \i Set\i0 is a dictionary of dictionaries. That's all there is to it really.\par
This was about everything that was necessary to explain the rudimentary data structures that are going to be employed in an implementation of the \b\i kbal\b0\i0 . Let's get more specific about the terms: \i Word\i0 , \i Sentence\i0 , \i Paragraph, Thesaurus \i0 and \i Thesauri\i0 .\par
Firstly a \i Word\i0 is always defined as a \i Reference\i0 which in itself refers to itself as well as the \i Sentence\i0 it belongs to. It thusly is also the \i atom\i0 of a \i Paragraph.\i0 Please note that it is entirely possible for a \i Word\i0 in the realm of kbal to be \b\i two or more\i0 \b0 words in the real-life sense.\i\par
\i0 A \i Sentence\i0 as opposed to a \i Word\i0 is defined as a \i List\i0 of \i Words\i0 . A Sentence may, \b or may not \i (!!)\i0 \b0 be part of a \i Paragraph\i0 .\par
The \i Paragraph\i0 is a \i Dictionary\i0 of \i References\i0 of \i Sentences.\i0\par
A \i Thesaurus\i0 is a \i Set\i0 of \i Paragraphs\i0 , and \i Thesauri\i0 , are \i Sets\i0 of \i Thesauruses\i0 \i [sic]\i0\par
\b [Section 2: Lexographical Analysis]\par
\b0 The lexographical analysis is split into two subsections, namely the \b\i generation\b0\i0 of \i Paragraphs\i0 and also the \b\i analysis\b0\i0 of \i Sentences\i0 .\par
For starters, I shall focus on the \b\i analysis\b0\i0 of \i Sentences\i0 , as without a reasonably large Thesaurus a \b generation\b0 is impossible. \par
\pard\li720\sa200\sl276\slmult1\b [Section 2.1: Analysis]\par
\b0 At this point it is assumed that the input was sanitized. Input may be a word, a sentence, or a paragraph. In case it is a paragraph, the analysis is supposed to happen using a \i Queue \i0 of\i Sentences\i0 . Beautifying (read: making safely parseable) the input is not the task of the \b kbal\b0 . \par
This is a quick overview of the criteria that a sentence should show:\par
\pard\li1440\sa200\sl276\slmult1 - The final token must be used to differentiate between a \b query\b0 and a \b statement\b0 . Thusly \b ?\b0 and \b !\b0 are always to be the last token. If the input was without any punctuation we must use \b heuristics\b0 to determine (a possible approach would be to match the syntax of the sentence against known templates for both questions and statements, apply \b\i fuzzy logic [1]\b0\i0 to it, and hope for the best)\par
- Numbers, IPs, and other worthless data should be dismissed but \b can\b0 be tied to the predecessing \i Word\i0 by a reference. Implementation of \i Obscures\i0 is entirely at the mercy of whomever is implementing \b kbal\b0 .\par
- Capitalization must not be compromised by sanitazion or beautification, otherwise bad things can happen, as the \b kbal\b0 aspires to know the difference between \b I helped my uncle Jack off a horse\b0 and \b I helped my uncle jack off a horse\b0 .\par
- If a sentence is divided by commata, the approach would be to re-arrange the entire sentence so that commata are no longer required; the division happens automatically using a derivation of the \b kbal\b0 and will usually no longer be gramatically correct, but it won't matter to the parser\par
\pard\li720\sa200\sl276\slmult1 Parsing a sentence matching all of the above criteria is \b\i trivial\b0\i0 , if you keep in mind the governing structure of \i Thesauri \lang1031\i0\u8594? \i Thesaurus \i0\u8594? \i Paragraph \i0\u8594? \i Sentence\i0 \u8594? \i Word\i0 .\par
\lang7\b\fs22 [1]\b0 : If you have never heard of anything called \b fuzzy logic\b0 , now would be a good time to brush up your knowledge on it by reading the respective \i wikipedia\i0 article. The application of fuzzy logic in this particular scenario, should be trivial, if you know that you can (=should) store \b\i Initializers\b0\i0 , \b\i Verbs,\b0\i0 \b\i Pronouns\b0\i0 , and \b\i Nouns\b0\i0 in seperate \i Thesauri\i0 .\par
\b\fs24 [Section 2.2: Generation]\par
\pard\li1440\sa200\sl276\slmult1 [Section 2.2.1: True Random]\b0\par
The generation of sentences can, and \b must\b0 the divided into another two sub-categories. Namely \b true random\b0 , and \b related\b0 generation. A true random generated sentence is a sentence that had no seed to begin with. Thus it can be a question, or a statement, and should feed off known \b\i Initializers\b0\i0 , \b\i Verbs\b0\i0 , \b\i Pronouns\b0\i0 and \b\i Nouns\b0\i0 . The application of a choice-algorithm is therefore necessary, if you want to keep some kind of coherency. In order to \b\i avoid\b0\i0 sentences that seemingly stop at random points, a score table matching nouns to their proximity to the end of a human generated input should be kept, and \b fuzzy logic [2.1:1]\b0 should be applied.\par
\b [Section 2.2.2: Related]\par
\b0 The generation of related sentences is where the \b\i kbal\b0\i0 gets interesting and where the actual maths begins. The seed that we work with should be entirely \b strapped\b0 from: unknown words, unknown structures and \b must\b0 be known to be either a question or a statement. Fuzzy logic as per 2.1 \b /MUST NOT/ \b0 be applied to the seed at this stage, for the simple reason, that an application of fuzzy logic would most likely yield a fifty-fifty chance of being either (if the \b\i kbal\b0 \i0 has been implemented correctly).\par
How the generation of a random related answer works (indented blocks are their mathetmatical representation): \par
\b LET\b0 Every \b\i Word\b0\i0 of \b\i Seed\b0\i0 be an \i Element\i0 of one or more \b\i Thesauri\b0\i0 .\par
\pard\li2160\sa200\sl276\slmult1 S \lang1031\u8494? \{Th1, Th2, ..., ThN\} = true\par
\lang7\line\pard\li1440\sa200\sl276\slmult1\b LET\b0 TrueSeed be a \b\i List\b0\i0 of \b every permutation\b0 of \i Seed\i0 .\par
\tab T = \lang1031\u8747?(\u8710?S)\par
\b LET\b0 TrueSeed be \b unambiguously \b0 matchable to \b one specific Thesaurus\b0 .\par
\pard\li2160\sa200\sl276\slmult1 T \u8494? Th = true\par
\pard\li1440\sa200\sl276\slmult1\b FOR \b0 Every \i T \i0 find a matching \i Reference\i0 from \i Th\i0\par
\pard\li2160\sa200\sl276\slmult1 Ref[index] = \u8721?[Th, T](T \u8494? Th)\lang7\par
\pard\li1440\sa200\sl276\slmult1\b FOR\b0 Every \i Ref\i0 find a matching \i Reference\i0 in \b any\b0 Thesaurus\par
\tab RefRef[Index] \lang1031\u8494? \{Th1, Th2, ..., ThN\} = true\par
\b LET \b0 The Initializer be found from any \i Ref\i0 for T of Th from the Thesaurus I\par
\pard\li2160\sa200\sl276\slmult1 I = RAND\{ \i Ref\i0 \u8494? ThInit \}\par
\pard\li1440\sa200\sl276\slmult1\b LET \b0 Sentence be an empty \i queue\i0 of \i words\line\b\i0 PUSH\i \b0\i0 The Initializer into the Sentence \line\line\tab S = \{ I, \}\lang7\par
\b FOR\b0 Every \i RefRef\i0 from I find another \i Ref\i0 in any Thesaurus and \b PUSH\b0 into the Sentence\par
\tab S = \{ S, \lang1031\u8721?[RefRef \u8494? ThInit, Ref \u8494? \{Th1, Th2, ..., ThN\}](I of RefRef) \}\par
\b FOR \b0 Every S that may also be found in Ref, find a Reference and replace\par
\pard\li2160\sa200\sl276\slmult1 S = \lang7 \lang1031\u8747?(\u8710?(\u8721?[S[i], Ref](S[i] \u8494? \{ Ref \}))\par
\pard\li1440\sa200\sl276\slmult1\b REPEAT\b0 The former step until no more \{ RefRef, \} \u8494? \{ Ref, \} is true\par
\b Formula:\par
\tab\lang7\b0 S \lang1031\u8494? \{Th1, Th2, ..., ThN\} = true\line\tab\lang7 T = \lang1031\u8747?(\u8710?S)\line\tab T \u8494? Th = true\line\tab Ref[index] = \u8721?[Th, T](T \u8494? Th)\line\tab\lang7 RefRef[Index] \lang1031\u8494? \{Th1, Th2, ..., ThN\} = true\line\tab I = RAND\{ \i Ref\i0 \u8494? ThInit \}\line\tab S = \{ I, \}\line\lang7\tab S = \{ S, \lang1031\u8721?[RefRef \u8494? ThInit, Ref \u8494? \{Th1, Th2, ..., ThN\}](I of RefRef) \}\line\tab S = \lang7 \lang1031\u8747?(\u8710?(\u8721?[S[i], Ref](S[i] \u8494? \{ Ref \}))\par
It is of course perfectly trivial to append the punctuation (\b ?\b0 , \b !\b0 , \b .\b0 ) because we know that: T \u8494? Th. Storing of this might be mandatory depending on your implementation of Ref[index] = \u8721?[Th, T](T \u8494? Th).\par
Everything else is of course perfectly obvious. A usual loop would then look like this:\par
- \b Gather \b0 input\line - \b Learn\b0 input (see analysis)\line - \b Process\b0 input (see above)\line - \b If\b0 every condition during Processing was true, then:\line\tab\b Return \b0 Sentence\par
\par
\pard\sa200\sl276\slmult1 For the \b kbal\b0 to work it is \b absolutely\b0 cruical that it never be fed with nonsense, if you want to keep its responses coherent. The results of being fed with nonsense can look like this:\par
\pard\li720\sa200\sl276\slmult1\i 21:27:28 <@shirin> Combo: was consciously planned and with material that was consciously selected we find that it agrees with the first class of qualities and in the other case with the second. geht drauf konzentriert juden zu ver[REDACTED]n davon gibts sowieso viel zu viele lol voll der lustige ding nee ich bin schon in rente. \par
\pard\sa200\sl276\slmult1\i0 And that's all there is to it! It's simple, but it works. \par
(C) 2010, C. Kiewiet, <knubbze@xin.lu>\line Contact me under: irc.arabs.ps, #arab if you have any questions\line\pard\li1440\sa200\sl276\slmult1\line\line\tab\line\tab\line\tab\line\tab\line\tab\line\lang7\par
\par
\pard\li720\sa200\sl276\slmult1 \fs22\par
\pard\sa200\sl276\slmult1\par
}