Leveraging Computational Linguistics to Use Somatic Idioms for Tracing the Velocity of Spreading Ideas


Ideas are abstract concepts. Spreading ideas can be considered as a flow of information. Idioms are merely small ideas. In addition, according to Arnold Zwicky, adjunct professor of linguistics at Stanford University, new idioms come along all the time. Given that most idioms can be traced to an origin and new idioms appears continuously, these small abstract concepts can be used as a flow tracer to analyze velocity of distributing ideas.

Language is a mental skill, which develops throughout the life of an individual. This developmental process can be examined using a number of techniques, and a computational approach is one of them. That is why, computational linguistics as an interdisciplinary field is leveraged. Moreover, this concept can be adapted to conduct a research within a particular speech community and over a specific span of time.

In this study, two dates of each and every idiom under review are identified. The first date is to pin point the first printed occurrence according to Google Books via Google Books Ngram Viewer. The second date is to pin point the earliest handwritten or printed occurrence in other sources than Google Books. After that, velocity of these two historical events is to be calculated and analyzed on big data.

Question / Proposal

First, this study is to prove that idioms can be used for tracing the velocity of spreading ideas.

Next, this research is focused on the quantitative assessment and analysis of the velocity of spreading ideas.

Also, a cross-checking is performed by matching the influence of historical events on the language development and vise verse.

One more thing. This study is to prove that a difficult scientific research can be done by a school student on a tight budget.

Finally, this scientific study can serve as a proven tool for archaeolinguists and extend other known research approaches.


Nova Spivack, a leading pioneer in semantic web technology, developed an approach to measuring the physical properties of ideas as they move in real-time through information spaces and populations such as the Internet. His approach is based on applying basic concepts from classical physics to the measurement of ideas — or what are often called memes — as they move through information spaces over time.

My scientific study goes along with the ideas of Nova Spivack, but with the following differences:

  • In addition to English Google Books' text corpora, on-line references to manuscripts, handwritten books and other sources than Google Books are used.
  • My study is about influencing introduced by appearing of mass-producing written matter, but Spivack's study establishes zero correlation to any historical event due to the generic nature. 
  • Spivack uses memes, my study utilizes idiomatic expressions, a sub-class of the memes. This sub-class can be better traced by conventional approaches. 
  • Spivack's work uses Google, but not Google Ngram Viewer to simplify the process of analysis.
  • My study uses mapping to waves, which is more relevant, rather than to, originally suggested by Spivack, particles.

All-in-all, this scientific study can help archaeolinguists to match the influence of historical events on the language development and vise verse. Also, it can extend Spivak's approach in terms of applying the hybrid method in order to simplify calculations and improving accuracy. On the other hand, the developed approach requires manual analysis that may not be acceptable in some cases.

Method / Testing and Redesign

In order to avoid time consuming library searches among rare Medieval and Renascence period books to identify the earliest recorded occurrence, Google Books Ngram Viewer is used.

The viewer is an online search engine that charts the frequencies of any set of comma-delimited search strings using a yearly count of n-grams found in English Google Books' text corpora (e.g., "be all ears", "best foot forward", "head in the clouds") of 25 million books printed between 1500 and 2008. In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech.

For example, the online computational linguistics tool is used to automatically analyze a three-gram idiom. And, Google Books Ngram Viewer presents a chart that goes up after 1920s, and has an overall upward trend.

Also, considering that “someone’s” can be replaced by a pronoun (his, her, their etc), an advanced search should be used to replace “someone’s” by “*_PRON”. As a result, the presented chart starts in 1800s, and has the downward trend since 1940s.

Further, considering the search results, Google Books is used. As a result, the search results are narrowed down to a few dozen sources.

Next, the Google Books search results are manually reviewed to distinguish idioms from the literal meaning of the idiom's individual elements. 

In this way, the first printed enclosure according to Google Books is found. For the “pulling someone’s leg” idiom: A Romance of Central India, Bithia Mary Croker, Diana Barrington, 1887.

Furthermore, in order to save time, while identifying the earliest handwriting or printed appearance of the idioms in other sources than Google Books, multiple on-line sources such as Theidioms.com and Dictionary.com are utilized. 

For example, according to Gary Martin, Ph. D, Phrases.org.uk founder, the “pulling someone’s leg” idiom was first printed in The Newark Daily Advocate in February of 1883.

As a result, two historical dates for the twenty-five somatic idioms under review are found. 

Here, the time difference between the first handwriting or printed appearance of the idioms in other sources than Google Books and the first print according to Google Books is calculated (Download…)

The difference for the “pull someone’s leg” idiom is four years.

The time difference for all the reviewed idioms is shown on this chart. The trend line indicates that the time difference contracted from hundreds of years at the Old English period to a negligibly short period nowadays.

The found date of the first handwriting or printed appearance of the idioms in other sources than Google Books and the found date of the first print according to Google Books are presented as a pulse front and a pulse drop respectively.

Finally, the number of impulse transitions per second on the transmission medium is calculated (Download…) All the calculations are made in concordance with the International System of Units. 

The wavy resulting trend line indicates that the velocity went up and down over the span of a number of years.


The calculated velocity for all the reviewed idioms is shown on this chart. The wavy trend line indicates that the velocity went up and down over the span of a number of years.

Johannes Gutenberg developed a printing system circa 1439. The chart shows some minor rise for this period up to 0.5 nBd.

Furthermore, as printing spread, governments established controls over printers across Europe, requiring them to have official licenses to trade and produce books. The grasp of censorship was weakened by the end of the 18th century. The trend line goes flat at about 0.1 nBd starting 1500s and ending by 1800s.

Next, Friedrich Koenig invented a high-speed steam-powered printing press in 1814. The chart starts demonstrating a steep rise after this year.

To sum up, taking into account that the received results fully coincide with the scrutinized historical events, the suggested approach can be used for tracing the velocity of spreading ideas.


The research results clearly show influence of historical events such as invention of a first printing press circa 1439 or a high-speed steam-powered printing press in 1814 on the velocity of spreading ideas.

However, some somatic idioms cannot be traced down to the roots due to heavy usage both literal and figurative meanings.

Also, the utilized research approach requires improving the data by funneling the manually process results to separate manuscripts from printed books.

As a result, further improvement of the accuracy and quality of the results of the experiment can be achieved by considering the representative sample of all the idioms as well as the manual separation of the first handwritten mention from the first printed mention.


Bibliography, references, and acknowledgements


