Abstract: Trace the origin, appreciate the methods, and admire the transcendence.
Chapter 1 · Text, Numbers, Language, Information#
Similarity and Consistency#
-
[Principle] Numbers, words, and natural language are all carriers of information, and they have natural connections. Like other texts, numbers were originally just tools for carrying information and did not have any abstract meanings. It was only half a century ago that Dr. Shannon proposed information theory, which consciously linked mathematics and information systems.
-
[Clustering] In ancient Egyptian hieroglyphs, words with the same pronunciation may be recorded using the same symbol. This concept of clustering has a great similarity in principle to the clustering in today's natural language processing or machine learning.
-
[Disambiguation] Clustering words according to their meanings will eventually bring about some ambiguity. By using context, disambiguation of polysemous words can be achieved in most cases.
-
[Translation] The reason why translation can be achieved is simply because different writing systems have equivalent abilities in recording information. (This conclusion is very important.) Furthermore, writing is only a carrier of information, not the information itself.
-
[Corpus] The decipherment of the Rosetta Stone has two guiding principles: 1. Redundancy of information ensures information security; 2. Language data, which we call corpus, especially bilingual or multilingual parallel corpus, is crucial for translation and serves as the foundation for our research in machine translation.
-
[Communication] In communication, if the channel is wide, information can be directly transmitted without compression; but if the channel is narrow, the information needs to be compressed as much as possible before transmission, and then decompressed at the receiving end.
-
[Checksum] The meticulous Jews invented a method similar to the checksum used in today's computers and communication when copying the Bible. They assigned a number to each Hebrew letter, so that the sum of the letters in each line would yield a special number, which became the checksum for that line.
-
[Grammar] Shakespeare's works were completely popular and mainstream in his time, including a large number of sentences that violated classical grammar. At that time, people began to attempt to improve (actually tamper with) Shakespeare's plays. However, these languages not only did not disappear today, but instead became classics, while those who attempted to improve his works have long been forgotten by the public. This involves a problem of linguistic research methods: is it about language or grammar? The achievements of natural language processing ultimately declare the victory of the former.