Final Version of our investigations into: Report on dangers and opportunities posed by large search engines, particularly Google September 30, 2007 H. Maurer, Co-author, editor and responsible for the project, Institute for Information Systems and Computer Media, Graz University of Technology Co-authors in alphabetical order: Dipl. Ing. Dr. Tilo Balke, L3S Hannover Professor Dr. Frank Kappe, TU Graz Dipl. Ing. Narayanan Kulathuramaiyer, TU Graz Priv. Dozent Dr. Stefan Weber, Uni Salzburg Dipl. Ing. Bilal Zaka, TU Graz We want to acknowledge input from Professor Dr. Nick Scerbakov, Univ. Doz. Dr. Denis Helic, both TU Graz, and we want to thank Mrs. M. Lampl for her infinite patience with various versions of this work and her formatting it into a somewhat coherent document. We want to particularly thank the Austrian Federal Ministry of Transport, Innvoation and Technology (Bundesministerium fur Verkehr, Innovation und Technologie) for the financial support of this report. Table of Contents Overview.................................................................................................................................................3 Executive Summary: The Power of Google and other Search engines................................................... 5 Introduction : A guide how to read this document.................................................................................. 7 Section 1: Data Knowledge in the Google Galaxy- and Empirical Evidence of the Google-Wikipedia Connection ......................................................................................................................................... 9 Section 2: Google not only as main door to reality, but also to the Google Copy Paste Syndrome: A new cultural technique and its socio-cultural implications .............................................................. 20 Section 3: The new paradigm of plagiarism - and the changing concept of intellectual property........ 34 Section 4: Human and computer-based ways to detect plagiarism - and the urgent need for a new tool .......................................................................................................................................................... 46 Section 5: Design of a pilot project for advanced plagiarism detection................................................ 55 Section 6: Different kinds of plagiarism: How to avoid them, rather to detect them............................ 59 Section 7: Dangers posed by Google, other search engines and developments of Web 2.0 ................ 73 Section 8: Emerging Data Mining Applications: Advantages and Threats........................................... 75 Section 9: Feeble European attempts to fight Google while Google's strength is growing.................. 81 Section 10: Minimizing the Impact of Google on Plagiarism and IPR Violation tools. ....................... 93 Section 11: Reducing the influence of Google and other large commercial search engines by building 50- 100 specilized ones in Europe.................................................................................................. 104 Section 12: Market Forces vs. Public Control of Basic Vital Services ............................................... 108 Section 13: The three-dimensional Web and Second Life"............................................................... 119 Section 14: Service oriented information supply model for knowledge workers ............................... 124 Section 15: Establishing an Austrian Portal ........................................................................................ 130 Section 16: References for Sections 1-5.............................................................................................. 134 Appendix 1: A Comparison of Plagiarism Detection Tools................................................................ 137 Appendix 2: Plagiarism - a problem and how to fight it ................................................................... 143 Appendix 3: Why is Fighting Plagiarism and IPR Violation Suddenly of Paramount Importance?.. 151 Appendix 4: Restricting the View and Connecting the Dots - Dangers of a Web Search Engine Monopoly ....................................................................................................................................... 161 Appendix 5: Data Mining is becoming Extremely Powerful, but Dangerous..................................... 169 Appendix 6: Emerging Data Mining Applications: Advantages and Threats ..................................... 180 List of Figures ..................................................................................................................................... 186 2 Overview The aim of our investigation was to discuss exactly what is formulated in the title. This will of course constitute a main part of this write-up. However, in the process of investigations it also became clear that the focus has to be extended, not to just cover Google and search engines in an isolated fashion, but to also cover other Web 2.0 related phenomena, particularly Wikipedia, Blogs, and other related community efforts. It was the purpose of our investigation to demonstrate: - Plagiarism and IPR violation are serious concerns in academia and in the commercial world - Current techniques to fight both are rudimentary, yet could be improved by a concentrated initiative - One reason why the fight is difficult is the dominance of Google as THE major search engine and that Google is unwilling to cooperate - The monopolistic behaviour of Google is also threatening how we see the world, how we as individuals are seen (complete loss of privacy) and is threatening even world economy (!) In our proposal we did present a list of typical sections that would be covered at varying depth, with the possible replacement of one or the other by items that would emerge as still more important. The preliminary intended and approved list was: Section 1: To concentrate on Google as virtual monopoly, and Google's reported support of Wikipedia. To find experimental evidence of this support or show that the reports are not more than rumours. Section 2: To address the copy-past syndrome with socio-cultural consequences associated with it. Section 3: To deal with plagiarism and IPR violations as two intertwined topics: how they affect various players (teachers and pupils in school; academia; corporations; governmental studies, etc.). To establish that not enough is done concerning these issues, partially due to just plain ignorance. We will propose some ways to alleviate the problem. Section 4: To discuss the usual tools to fight plagiarism and their shortcomings. Section 5: To propose ways to overcome most of above problems according to proposals by Maurer/Zaka. To examples, but to make it clear that do this more seriously a pilot project is necessary beyond this particular study. Section 6: To briefly analyze various views of plagiarism as it is quite different in different fields (journalism, engineering, architecture, painting, ...) and to present a concept that avoids plagiarism from the very beginning. Section 7: To point out the many other dangers of Google or Google-like undertakings: opportunistic ranking, analysis of data as window into commercial future. Section 8: To outline the need of new international laws. Section 9: To mention the feeble European attempts to fight Google, despite Google's growing power. Section 10. To argue that there is no way to catch up with Google in a frontal attack. 3 Section 11: To argue that fighting large search engines and plagiarism slice-by-slice by using dedicated servers combined by one hub could eventually decrease the importance of other global search engines. Section 12: To argue that global search engines are an area that cannot be left to the free market, but require some government control or at least non-profit institutions. We will mention other areas where similar if not as glaring phenomena are visible. Section 13: We will mention in passing the potential role of virtual worlds, such as the currently over- hyped system "second life". Section 14: To elaborate and try out a model for knowledge workers that does not require special search engines, with a description of a simple demonstrator. Section 15 (Not originally part of the proposal): To propose concrete actions and to describe an Austrian effort that could, with moderate support, minimize the role of Google for Austria. Section 16: References (Not originally part of the proposal) In what follows, we will stick to Sections 1 -14 plus the new Sections 15 and 16 as listed, plus a few Appendices. We believe that the importance has shifted considerably since the approval of the project. We thus will emphasize some aspects much more than ever planned, and treat others in a shorter fashion. We believe and hope that this is also seen as unexpected benefit by BMVIT. This report is structured as follows: After an Executive Summary that will highlight why the topic is of such paramount importance we explain in an introduction possible optimal ways how to study the report and its appendices. We can report with some pride that many of the ideas have been accepted by the international scene at conferences and by journals as of such crucial importance that a number of papers (constituting the appendices and elaborating the various sections) have been considered high quality material for publication. We want to thank the Austrian Federal Ministry of Transport, Innovation and Technology (BMVIT) for making this study possible. We would be delighted if the study can be distributed widely to European decision makers, as some of the issues involved do indeed involve all of Europe, if not the world. 4 Executive Summary: The Power of Google and other Search engines For everyone looking at the situation it must be clear that Google has amassed power in an unprecedented way that is endangering our society. Here is a brief summary: Google as search engine is dominating (Convincing evidence on this is easily available and presented in Section 1). That on its own is dangerous, but could possibly be accepted as "there is no real way out", although this is not true, either. (We would rather see a number of big search engines run by some official non-profit organisations than a single one run by a private, profit driven company.) However, in conjunction with the fact that Google is operating many other services, and probably silently cooperating with still further players, this is unacceptable. The reasons are basically: - Google is massively invading privacy. It knows more than any other organisation about people, companies and organisations than any institution in history before, and is not restricted by national data protection laws. - Thus, Google has turned into the largest and most powerful detective agency the world has ever known. I do not contend that Google has started to use this potential, but as commercial company it is FORCED to use this potential in the future, if it promises big revenue. If government x or company y is requesting support from Google for information on whatever for a large sum, Google will have to comply or else is violating its responsibilities towards its stockholders. - Google is influencing economy by the way advertisements are ranked right now: the more a company pays, the more often will the add be visible. Google answers that result from queries are also already ranked when searches are conducted (we give strong evidence for this in Section 1): Indeed we believe it cannot avoid ranking companies higher in the future who pay for such improved ranking: Google is responsible to stockholders to increase the company's value. Google is of course doing this already for ads. - Since most material that is written today is based on Google and Wikipedia, if those two do not reflect reality, the picture we are getting through "googeling reality" as Stephan Weber calls it, is not reality, but the Google-Wikipedia version of reality. There are strong indications that Google and Wikipedia cooperate: some sample statistics show that random selected entries in Wikipedia are consistently rated higher in Google than in other search engines. - That biased contributions can be slipped into Wikipedia if enough money is invested is well established. - Google can use its almost universal knowledge of what is happening in the world to play the stock market without risk: in certain areas Google KNOWS what will happen, and does not have to rely on educated guesses as other players in stock market have to. This is endangering trading on markets: by game theory, trading is based on the fact that nobody has complete information (i.e. will win sometimes, but also loose sometimes). Any entity that never looses rattles the basic foundations of stock exchanges! - It has to be recognized that Google is not an isolated phenomenon: no society can leave certain basic services (elementary schooling, basic traffic infrastructure, rules on admission of medication,... ) to the free market. It has to be recognized that Internet and the WWW also need such regulations, and if international regulations that are strong enough cannot be 5 passed, then as only saving step an anti-Trust suite against Google has to be initiated, splitting the giant in still large companies, each able so survive, but with strict "walls" between them. - It has to be recognized that Google is very secretive about how it ranks, how it collects data and what other plans it has. It is clear from actions in the past (as will be discussed in this report) that Google could dominate the plagiarism detection and IPR violation detection market, but chooses not to do so. It is clear that it has strong commercial reasons to act as it does. - Google's open aim is to "know everything there is to know on Earth". It cannot be tolerated that a private company has that much power: it can extort, control, and dominate the world at will. I thus call for immediate action, and some possibilities are spelt out in this report. One word of warning is appropriate: This report originated from a deep concern about plagiarism using Google, about the Google Copy Paste syndrome, as one of the authors has called it. Consequently, this aspect is covered more deeply than some others in the main body of the paper, some economical and political issues are moved into independent appendices. If the danger of Google is your main concern, skip now to Sections 7 and 8, otherwise read the introduction which basically explains how this voluminous document should be read depending on your preferences. Hermann Maurer, Graz/Austria, September 30, 2007 hmaurer@iicm.edu www.iicm.edu/maurer 6 Introduction : A guide how to read this document As it has hopefully become clear in the executive summary, this report started with an important but still limited topic: fighting plagiarism and IPR violations. It lead rapidly into the study of threats posed by new methods of data-mining, employed by many organisations due to the fact that no international laws regulate such activities. The archetypical company that has turned this approach into a big success for itself but a veritable threat for mankind is Google. As we discuss how to possibly minimize the power of Google and similar activities we implicitly show that there is also room for European endeavours that would strengthen the role of Europe and its economy. The leader of this project team H. Maurer from Graz University of Technology was thus forced to put together a team consisting of both researchers and implementers (who would try out some ideas) in "no time" and to compile a comprehensive report on the current situation and possible remedies in much less than a year, although the undertaking at issue would have needed much more time, resources and manpower. Thus we are proud to present a thorough evaluation including concrete recommendations in this report and its appendices, yet we realize that the necessary parallelism of work has created some redundancies. However, we feel that these redundancies are actually helpful in the sense that this way the 15 Sections and the six Appendices we present can be read fairly independently. However, to give a bit of a guidelines we recommend to read the report, depending on what your main interests are, in different ways, and we are trying to suggest some as follows: If you are mainly interested in how Google, and surprisingly Wikipedia (two organisations that work much closer together than they are willing to admit) are changing the way we do research and learn (and hence influence the fabric of society), and that somehow this trend should be reverted, then Sections 1-5 are the ones you should definitely start with. If you are more interested in the concept of plagiarism and how it applies to IPR violations, we recommend to start with Appendix 1, with a more specific analysis in Appendix 2, and a totally different and new approach that we are still going to pursue vigorously in Section 6. Those more concerned about the influence that Google (and data mining) has on privacy but also other issues may want to start with Appendices 4, 5 and 6. Readers who want to learn why we consider Google a serious threat to economy, much beyond invasion of privacy, elimination of intermediaries, threatening traditional ways of advertising, etc. should start with Section 7, and have an extended look a the situation in Section 8. We feel it is not sufficient to recognize the danger that Google poses, but we also need alternatives. We regret that Europe is not even taking up the challenge the search engine Google poses, let alone all the other threats, as we discuss in Section 9. We do propose specific measures how to minimize the impact of Google as search engine (and how to by-pass Google's unwillingness to help with plagiarism detection) in Sections 10 and 11. We explain why data-mining is an area of vital interest to all of humanity, like the provision of water, elementary schooling, etc. in Section 12 and hence should be recognized as such by governments on all levels. We take a quick look at other new developments on the Internet (particularly Second Life) and try to make a guess how they might correlate with other aspects of data-mining in Section 13. 7 We show in Section 14 that the tools we have proposed (e.g. the collaborative/ syndicated tools of Section 10) can also be used to make knowledge workers more efficient, an important aspect in the fight for economic prosperity in Europe. In Section 15 we briefly discuss one approach that we are currently testing. If carried out with some support by government and sponsors on a European level as proposed in Section 11 it would not only reduce the power of Google as search engine dramatically, but create new jobs and economic value. It remains to say that Appendix 3 has been put in as a kind of afterthought, to present some issues that we did not cover elsewhere, particularly the relation between plagiarism detection and IPR violation. It is worth mentioning that we have collected much more data in the form of informal interviews or E- Mails than we dare to present since it would not stand up in court as acceptable evidence, if we are sued by a large data-mining company. However, this evidence that we cannot present here has increased our concern that Google is well on the way of tying to dominate the world, as one of the IT journalists of Austria (Reischl) has very courageously put it. We hope that this collection of probes, tests and research papers that we have compiled under enormous time pressure due to what we see as very real danger will help to move some decision makers to take steps that reduce some of the dangers that we do point out in this report, and help to create new job opportunities for Europe. 8 Section 1: Data Knowledge in the Google Galaxy- and Empirical Evidence of the Google-Wikipedia Connection (Note: Sections 1-5 are basically material produced by S. Weber after discussions with H. Maurer, with H. Maurer doing the final editing) In the beginning of the third millennium, we are faced with the historically unique situation that a privately owned American corporation determines the way we search and find information - on a global scale. In former times, the self-organisation of storage and selection of our common knowledge base was primarily the task of the scientific system - especially of the librarians of the Gutenberg Galaxy. In the last 15 years, this has changed dramatically: We are moving with enormous speed from the Gutenberg to the Google Galaxy. Fuelled by techno enthusiasm, nearly all social forces - political, scientific, artistic, and economic ones - have contributed to the current situation. Just think of the affirmative optimism we all felt in the nineties: With Altavista, we started to search for key terms of our interest - and the results were really astonishing: The notion of "networking" quickly gained a completely new dimension. All of a sudden, it was possible to collect information we otherwise und before had no means to gather. An example: In 1996, one of the authors of this report was deeply into autopoietic systems theory of German sociologist Niklas Luhmann. Typing the term "systems theory" into Altavista did not only lead to several research groups directly dealing with Niklas Luhmann's autopoietic theory at that time, but also to the work of Berlin sociologist Rodrigo Jokisch who developed his own sociological "theory of distinctions" as a critique as well as an extension of Luhmann's theory. In a web commentary found with Altavista, the author read for the first time about Jokisch's book "Logik der Distinktionen" which was to be released. For the first time the net (especially the search engine) did a connection on a syntactic level that made sense in the semantic and pragmatic dimension as well: Data turned into knowledge. Seen from today, it is easy to reconstruct this situation and also to re-interpret why we all were fascinated by this new technology. One of the first scientific metaphors used for search engines from that period was the so called "meta medium". Search engines were in the beginning of theory-building described as "meta media" [Winkler 1997] and compared for example to magazines covering the TV programme: They were seen as media dealing with other media (or media content), interpreted as media "above" media, as second order media (and remember in this context sites like thebighub.com: a meta search engine, a meta medium for many other meta media). The self-organisation of the (economic and technological) forces of the web has led to a very ambivalent situation today: As Google came up in 1998, step-by-step Altavista lost its position as leading search engine defining the standards of what we seek and find, and Google became increasingly the new number one. None the less media science still describes the web search situation with old metaphors from the print era: Instead of speaking of a (neutral!) "meta medium", today the metaphor of Google as the new "gate keeper" of the web is widely spread [for example Machill & Beiler 2007]. Note that the "gate keeper" is always something dialectical: It enables (some information to come in) and it prevents (other information to reach the user). More on search engines see the excellent book [Witten 2006] and on the history of Google see [Batelle 2005] and [Vise 2005]. 9 To demonstrate the over-all importance of Google, just have a look at some data: Figure 1: Google clearly leading the ranking of US searches in 2005 [Machill & Beiler 2007, 107] In this figure, you clearly see the dominating role of Google amongst the searches in the US in November 2005 (according to Nielsen/ NetRatings): Nearly every second search in the US was done with Google. If one compares autumn 2005 data with spring 2007, one can see that Google has gained once more (from 46,3% to about 55%) and the other search engines lost: Figure 2: More than every second online search is done with Google [http://www.marketingcharts.com/interactive/share-of-online-searches-by-engine-mar-2007-294] 10 The world-wide success story of Google is unique in the history of media technologies: Not only that Google in fact had developed the best search algorithm (the so-called "PageRank algorithm" after its inventor Larry Page - and not after the page in the sense of a site), there also happened a strange socio-cultural process which is also known as "Matthew effect" or labelled as "memetic" spreading: As more and more people started to use the new search engine, even more and more people started to use it. The Google revolution was (a) emergent und (b) contingent (in the sense that nobody could forecast it in the mid-nineties). In the eye of the users, it seemed to be an evolutionary development, but as seen in the context of the historical evolution of searching, archiving and finding information, we were witnesses of an incomparable revolution. And the revolution still goes on. What has happened? Sergey Brin, one of the two founders of Google, had the idea that information on the web could be ordered in a hierarchy by the so-called "link popularity" (this means: the more often a link directs to a specific page, the higher this page is ranked within the search engine results). Other factors like the size of a page, the number of changes and its up-to-dateness, the key texts in headlines and the words of hyperlinked anchor texts were integrated into the algorithm. Step-by-step, Google improved an information hierarchisation process which most net users trust blindly today. The result was perplexing: "If Google doesn't find it, it doesn't exist at all" quickly became the hidden presupposition in our brains. The importance of Google increased further and further as not only the search algorithms constantly evolved, but also as the socio-cultural "Matthew effect" remained unbroken. We propose an experimental situation in which a group of people is asked to search a term on the net. We let the test persons start with an unsuspicious site (e. g. www.orf.at in Austria, the web site of the nationwide broadcast cooperation). Our hypothesis is that about 80 or even 90 percent of the people will start their search with Google (maybe in an experimental setting also Yahoo or Microsoft Live searchers will tend to use Google). There is much evidence for this virtual monopoly of Google. Just look at the following table: Table 1: Percentage of US searches among leading search engine providers - Google rated even higher [http://www.hitwise.com/press-center/hitwiseHS2004/search-engines-april-2007.php] One can clearly see that the success story of Google did not stop within the last twelve months. On the contrary, from April 2006 to April 2007 the percentage of Google searches amongst all US searches rose from 58.6 to 65.3 (which means that Google ranks hit-wise even higher than Nielsen/ NetRatings). Yahoo and especially MSN did lose (in the moment it is questionable if Microsoft's "new" search engine Live can stop this development in the long run). Please note that the top-three search engines make up 94.45 percent of the whole "search-cake". This number shows dramatically what a new search engine would have to do if it wants to get a piece from that cake. To enforce that 11 users leave Google and begin to trust another search engine would need an amount of money as well as technological competence and marketing measures beyond all thought. And not to forget the "butterfly effect" in combination with the "Matthew effect" in the socio-cultural context: People must start to use the new search engine because other people already do so whom they trust (two-step-flow or second-order effect). Let us call the new search engine wooble.com. Then opinion leaders worldwide have to spread: "Did you already use Wooble? It shows you better results than Google." In the moment, this process is completely unrealistic. The overwhelming virtual monopoly of Google already led to the equation "Google = Internet". Will Google also develop a web browser and thus swallow Internet Explorer of Microsoft (as the Explorer swallowed Netscape Navigator before)? Will Google develop more and more office software tools (first attempts can be tried out in the "Google Labs" zone, for example the "Text&Tabellen" programme or the by now quite wide-spread Gmail free mail software)? Will Google turn from the virtual monopoly on the web to the Internet itself? In a blog one could read recently that Google is not really a competitor anymore, but already the environment [cited after Kulathuramaiyer & Balke 2006, 1737]. Will we one day be faced with the situation that our new all-inclusive medium (as prime address in all concerns worldwide), the Internet and a private owned corporation with headquarters near San Francisco will fall together, will form a unity? Figure 3: Vision of Google in 2084 [From "New York Times", 10 October 2005, http://www.nytimes.com/imagepages/2005/10/10/opinion/1010opart.html] Hyper-Google = the net itself then will know everything about us: Not only the way we currently feel, the things we bought (and would like to buy) and searched for, but also our future (see figure 3). In fact this vision is not very comfortable, and this is the main reason why strategies are discussed to stop Google, mainly through legal limitations or otherwise restrictions from the government or the scientific system [Machill & Beiler 2007, see also Maurer 2007a and 2007c]. As shown in the figures above, in the moment at least about two thirds of all US web inquiries are executed via Google. If you look at a specific European country, e. g. Austria, the situation is not different at all. 12 Figure 4: Users' search engine preferences in Austria 2006 compared to 2005 [Austrian Internet Monitor, http://www.integral.co.at, thanks for the slide to Sandra Cerny] 94 percent of the Austrian Internet users said that they used Google at least once in the last three months. The number has increased from 91 to 94 percent between March 2005 and March 2006. So if users are asked (instead of examining actual searches amongst users), the dominance of Google gets even clearer: In the moment there seems to be no way to compete with Google any more - although we know that web users can globally change their behaviour and preferences very quickly in an indeterminable way ("swarm behaviour"). Finally, to present empirical evidence for the dominance of Google, we can not only look at actual searches or for example at the users of a specific nation, but we can also have a look at a specific group of people, for example at journalists, scientists or students: Surprisingly enough, we nearly have no hard facts about the googling behaviour of pupils and students. We know that pupils more and more tend to download ready-made texts from the net, for example from Wikipedia or from paper mills. We have no large-scale statistics on the Google Copy Paste behaviour of students as described in some alarming case studies by Stefan Weber [Weber 2007a]. But we do have some empirical evidence on the googling behaviour of another group of text- producing people: of journalists. More and more, initiatives to maintain journalistic quality standards complain that also journalistic stories are increasingly the result of a mere "googlisation of reality". One drastic example is described by the German journalist Jochen Wegner [Wegner 2005]: A colleague of him did a longer report on a small village in the north of Germany. He reported about a good restaurant with traditional cooking, a region-typical choir doing a rehearsal in the village church and about a friendly farmer selling fresh agricultural products. If you type the name of the small village into Google, the first three hits are the web sites of the restaurant, the farmer and the choir. And if you compare the complete story of the journalist with the texts on the web sites found by Google, you will see: As a journalist of the 21st century, you don't have to be at a place to write a "personal" story about it. Google can do this for you. Two recent empirical studies tried to enlighten the googling behaviour of journalists in German- speaking countries: The Swiss study "Journalisten im Internet 2005" and the Austrian study "So 13 arbeiten Osterreichs Journalisten fur Zeitungen und Zeitschriften" in the year 2006. In both studies, the overwhelming dominance of Google was evident: Between 2002 and 2005, the number of Swiss journalists who do primarily Google research grew from 78.5 to 97.1 (!) percent. Therefore one can say that nearly every Swiss journalist at least also used Google in 2005 - and this won't have changed much until today. Figure 5: Search engine preferences of Swiss journalists 2005 compared to 2002 [Keel & Bernet 2005, 11] Also, in Austria 94.8 percent of the print journalists asked in a survey admitted that they start their research for a story at least sometimes with the googling of keywords, names, or images. 60 percent of the journalists google continuously. Figure 6: Googling as new starting point of research for journalists in Austria 2005 "Wie oft beginnt Ihre Recherche mit folgenden Handlungen?" regelma ig manchmal nie 6 9 ,6 2 8 ,7 mit dem Griff zum Telefo n 1,7 6 0 ,2 3 4 ,6 5,2 mit dem Go o geln vo n Begriffen, Namen o der Bildern 3 6 ,4 3 3 ,5 3 0 ,1 mit dem Abfragen des AP A-Archivs 3 3 ,3 4 2 ,8 2 3 ,9 mit dem Abfragen des (haus -) eigenen digitalen Medienarchivs 2 1,3 58 ,5 2 0 ,1 mit der Lekture vo n (Sach-)Buchern 12 ,5 4 9 ,9 3 7,6 mit der Suche im (haus -)eigenen P apier-Archiv 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Met hode: Online-Befragung; Erhebungszeit raum: 16.11.2005 bis 5.12.2005; n=296 P rint journalist en [Weber 2006a, 16] All the data published here could be reported almost world-wide in the same way, with a few exceptions. Google for example is not strong in South Korea, where a publishing group with "Ohmynews" is dominating the scene. Anyway, without being pathetic we have to state that Google rules the world of information and knowledge. If the world has turned into an information society or even into a knowledge society (according to philosopher Konrad Paul Liessmann the notion of a "knowledge society" is only a myth), than googling is the new primary cultural technique to gather information (or better: to gather data from the web). The Google interface and the (hidden) way 14 Google ranks the information found is the core of the information society in the moment. Never before in history was this organized by a private enterprise. In the current academic world we do not only observe the so-called "Google Copy Paste Syndrome" (GCPS) as new questionable cultural technique of students [Weber 2007a, Kulathuramaiyer & Maurer 2007], but also an interesting by-product of the GCP-technique: the obvious Google-Wikipedia connection (GWC). As seen from net user behaviour, this means when typing a specific technical term into Google, one will tend to click on the Wikipedia link because one will probably trust Wikipedia more than other sources (because of the collective generation of knowledge in the Wikipedia leading to a specific form of consensus theory of truth, one might probably feel that information on this net encyclopaedia is less biased than information from various other sites. This notion is much defended in [Surowiecki 2004] but equally well attacked in the "must-read" book [Keen 2007]. In this context we also recommend an experimental setting in which we tell students to search for a list of technical terms on the net and copy & paste the definitions obtained - the hypothesis is that a great majority will consult Wikipedia and copy & paste definitions from that site. Our everyday experience with the googling of technical terms led us to the intuitive assumption that Wikipedia entries are significantly more often ranked under the top three or even on the first place of the Google search results than other links. One reason could be the Google search algorithm which probably ranks sites higher if they are big and continuously updated. In a Google watchblog a blogger recently wrote that the "new" Google search algorithm ranks sites higher the more they are not only externally, but also internally linked and the more often the site has changed (see http://www.jmboard.com/gw/2007/04/28/neues-patent-zum-zeitlichen-ranking-von-webseiten). Of course this would be a good explanation why Wikipedia entries are very often ranked very high in the Google matches list. But the ranking of Wikipedia entries could also have another more unpleasant reason: It is possible that Google ranks some sites intentionally higher than others. This would mean that information is not only put into a hierarchical order by an "abstract" mathematical algorithm after indexing the text galaxy of the net, but also by human forces tagging some sites with higher points than others (which will lead to a higher ranking, again by an algorithm). This is not only a speculation. Wikipedia itself reported in 2005 a strange fact about a cooperation between the net encyclopaedia and Yahoo: "An agreement with Yahoo in late Spring 2004 added them as one of the search engines widgets offered for web/Wikipedia searching when our internal search goes down, and we also provided them with a feed of our articles (something we needed to do anyway). In exchange, we were apparently linked more prominently, or more effectively, in their searches, and have access to stats on how many click-throughs we get from them. It's not clear how we were linked more prominently, but click- throughs increased by a factor of 3 in the months after this agreement (and then levelled off)." [http://meta.wikimedia.org/wiki/Wikimedia_partners_and_hosts; boldface introduced by the authors of this report] So Wikipedia admits that after an agreement with Yahoo they observed that Wikipedia results began to climb up on the list of search matches. The question remains: Due to a - coincidentally - change or improvement of the Yahoo search algorithm or due to some kind of "human intervention"? This case addresses the attention to a blind spot of large search engines: Is everything done by computer programmes, or is there some kind of intentional intervention into the search results? Remember also that the Chinese version of the Google web site has shown that Google is able to and also will bias the information one will retrieve. Also a German scientist wrote about a cooperation between Yahoo and Wikipedia as well as between Google and Wikipedia: 15 "Some months ago the 'Wikimedia Foundation' has signed an agreement with 'Yahoo's' rival 'Google' that guarantees that - as well as with 'Yahoo' - a matching Wikipedia article is ranked all above on the list of search results." [Lorenz 2006, 86 f.; translation by the authors of this report] The same scientist also speculated: "How does Wikipedia finance that? Or does Google donate that service?" [Lorenz 2006, 87, footnote 14; translation by the authors of this report] In a personal eMail correspondence, the author informed us that the speaker of Wikipedia Germany strictly denied the existence of such an agreement. The topic of probably intentionally biasing search results is central for the credibility and reliability of large search engines. As shown above, in the current information or knowledge society search engines - and especially Google - determine what we search and find. They are not only "meta media", they are not only "gate keepers" of information. In fact they are much more: Especially Google has become the main interface for our whole reality. This ranges from the search of technical terms to news searches on a specific topic. If we speak about the interface for our whole reality, we mean that in an epistemological kind of way. Of course many documents in the web (especially in the hidden "deep web", in sites which can only be accessed with passwords, in paid content sites etc.) won't be found by Google. And many documents (not only older ones!) are still only available offline. To be precise: With the Google interface the user gets the impression that the search results imply a kind of totality. In fact, one only sees a small part of what one could see if one also integrates other research tools. (Just type the word "media reception" into Google. The first results won't give you any impression on the status quo of this research field. In fact especially with scientific technical terms one often gets the impression that the ranking of the results is completely arbitrary.) For this report, we did the following experiment: We randomly chose 100 German and 100 English keywords from the A-Z index of the two Wikipedia versions and typed these keywords into four large search engines. We noted the place of the Wikipedia link in the specific ranking list. This experiment for the first time shows how our knowledge is organised in rather different ways by different research tools on the net. Detailed description of the experiment: We typed 100 randomly chosen keywords with already existing text contributions (and some few forward-only keywords) from http://de.wikipedia.org/wiki/Hauptseite into: - http://www.google.de - http://de.yahoo.com - http://de.altavista.com and - http://www.live.com/?mkt=de-de and noted the ranking.. We decided to use these four search engines because A9.com does not work with its own search engine, but shows the web search results of live.com, the products search results of amazon.com itself and additionally results of answers.com and others. And also AOL research is powered by Google - so we would have compared Google with Google or Microsoft Live with Microsoft Live. For the duplication of the experiment in English language, we randomly took 100 keywords from the English Wikipedia site http://en.wikipedia.org/wiki/Main_Page (also from the A-Z index) and typed them into: - http://www.google.com - http://www.yahoo.com - http://www.altavista.com - http://www.live.com and again noted the ranking. 16 And here are the results based on 100 randomly chosen keywords: Figure 7: Percentage of Wikipedia entries ranked as first results 80% 70% 70% 60% 50% 50% 45% 43% german sites 40% 37% english sites 24% 30% 21% 15% 20% 10% 0% Google Yahoo Altavista Live Search engine used Figure 8: Percentage of Wikipedia entries within the first three results 100% 91% 90% 76% 80% 70% 62% 56% 58% 60% 50% german sites english sites 37% 40% 30% 27% 24% 20% 10% 0% Google Yahoo Altavista Live Search engine used 17 Figure 9: Percentage of Wikipedia entries beyond top ten results 80% 69%70% 70% 60% 54% 50% 40% german sites 40% 36% 31% english sites 30% 20% 9% 10% 2% 0% Google Yahoo Altavista Live Search engine used The results prove two clear tendencies: 1) Google is clearly privileging Wikipedia sites in its ranking - followed by Yahoo. Microsoft's search engine Live only rarely ranks Wikipedia entries very high. 2) The German Google version privileges the German Wikipedia site significantly more than the international Google site privileges the English Wikipedia site. Of course we can exclude the assumption that the German and the international Google sites operate with completely different algorithms. So we are able to draw two logical conclusions from the astonishing result mentioned under 2): 1) Context factors influence the ranking: One explanation would be the fact that the German Wikipedia is much more internally linked than the English mother site (for which we have no evidence). Another (weak) explanation could be the fact that the German Wikipedia in German- speaking countries is more popular than the English site in English-speaking countries (which is true) and that this fact indirectly leads to a better ranking because more links lead to the German Wikipedia (which is questionable). 2) The other conclusion is scary: Google does in a strange and unknown way privilege Wikipedia entries - followed by Yahoo; and Google does this intentionally more with the German version. We recommend immersing deeper into this apparent correlations or strange coincidences. For this report, we only did a pilot. To do comparative research in search engine results ranking in fact would be a new and crucial field of research which in the moment still is a big desideratum. The apparent Google-Wikipedia connection (GWC) is also problematic from an epistemological point of view: When people google key terms, they need no brain effort to do research: everybody can type a word or a phrase into a search engine (in former times, one needed basic knowledge about the organisation of a library and the way a keyword catalogue operates, and one needed to work with the so-called "snowball system" to find new fitting literature in the reference lists of already found literature). So there is a clear shift in the field of research towards a research without brains. But there 18 also is another shift in the way encyclopaedic knowledge is used: In former times facts from a print encyclopaedia maximally marked the starting point of a research (who ever cited the Encyclopaedia Britannica or the Brockhaus verbatim?). Today one must observe countless students copying passages from Wikipedia. Thus a term paper can be produced within a few minutes. Students lose the key abilities of searching, finding, reading, interpreting, writing and presenting a scientific paper with own ideas and arguments, developed after a critical close reading process of original texts. Instead of that they use Google, Copy & Paste and PowerPoint. Their brains are now contaminated by fragmented Google search terms and the bullet points of PowerPoint. For a critique on PowerPoint see also [Tufte 2006]. The online encyclopaedia Wikipedia is problematic not only because of vandalism or fabrication of data. It is also problematic because of the often unknown origin of the basic texts than adapted by the authors' collective of net users. In some reported cases already the very first version of a Wikipedia entry was plagiarised, often copied nearly verbatim without the use of much "brain power" from an older print source. Some of these cases are precisely described in [Weber 2005b] and [Weber 2007a, 27 ff.]. We have to state that there is a systematic source problem in Wikipedia, because the problem of plagiarism was ignored too long, and is still being ignored. For example, just type the word "Autologisierung" into the German Wikipedia site. You will find an entry which was published some times ago by an unknown person. But the entry is rather brainless plagiarism of the subchapter "Autologisierung" in a scientific contribution by one of the authors of this report (Stefan Weber) which appeared 1999 in a print anthology. Nearly nobody ever used that word since than, and in fact there is absolutely no reason why it is a keyword in the German Wikipedia :-). These are only some examples or case studies of an evolving text culture without brains also on the Wikipedia. We should also not overlook that American Wikipedia and Google critic Daniel Brandt reported about 142 cases of plagiarism on the Wikipedia [N. N. 2006b]. The knowledge reliability of Wikipedia remains a big problem for our common knowledge culture. Doing unreflecting cut and paste from Wikipedia seems to be a new cultural technique which has to be observed with care. If Google privileges Wikipedia and thus makes the way to do copy & paste even more straight, the problem is a double one: a search monopoly of a private corporation and the informational uncertainty of a collective encyclopaedia which also marks a new knowledge monopoly. We will return to some of the above issues in Section 7. For references see Section 16. 19 Section 2: Google not only as main door to reality, but also to the Google Copy Paste Syndrome: A new cultural technique and its socio-cultural implications (Note: Sections 1-5 are basically material produced by S. Weber after discussions with H. Maurer, with H. Maurer doing the final editing) As shown in Section 1, at the moment Google determines the way we search and find information in the net to a degree that media critics, scientists and politicians cannot longer remain silent about: they should actively raise their voice. Section 1 dealt with the epistemological revolution in the net age: The ranking algorithm of a private owned corporation listed on the stock exchange dictates which information we receive and which information is neglected or even intentionally suppressed. As elaborated in Section 1, this is a unique situation in the history of mankind. But there is also an important consequence affecting our whole knowledge production and reception system on a more socio-cultural level: After googling a technical or common term, a name or a phrase or whatever, especially the younger generation tends to operate with the found text segments in a categorically different way than the generation of the Gutenberg Galaxy did: While the print generation tended towards structuring a topic and writing a text by themselves, the new "Generation Google" or "Generation Wikipedia" is rather working like "Google Jockeys" [N. N. 2006a] or "Text Jockeys" [Weber 2007f]: They approach text segments in a totally different manner then the print- socialised generation. Text segments found on the web are often appropriated despite their clearly claimed authorship or despite their clearly communicated copyright restrictions because they are seen as "free" and/or "highly reliable". One often hears persons accused of net plagiarism justifying themselves: "But it's already written on the web - why should I put it in new words anyway?". See the quite disenchanting example in [Weber 2007a, 4] The new generation of text jockeys starting with googling key terms or phrases tends to cut and paste found information of the search engine's result list directly into their document and claim a new authorship from now on. Of course plagiarism also occurred in the print era (as will be shown later on), but the Google Copy Paste technique is something categorically new in the history of the relationship between text and author. We will give the following clear and unmasking example as an introduction: Type the German word "Medienrezeption" (media reception) into Google. You will see that the third entry of the search results list is a Swiss "Lizentiatsarbeit" (a kind of equivalent to a master thesis) about media reception in the context of school kids. Absurdly enough, this found site has nearly nothing to do with the current status quo of media reception research. We have no idea how many external links must lead to this completely arbitrary result that it went up to rank 3 on the Google list (we checked it for the last time on 20 May 2007; but the link has already been on the top three of the list of Google results months - and probably years - ago). 20 Figure 10: Searching "Medienrezeption" with Google [Screenshot, 17 May 2007] If one clicks on the link, one will find the following text segment (which was - as we suppose - written by the author of the Swiss master thesis and not copied from elsewhere): 21 Figure 11: Document found on third place of the Google results [http://visor.unibe.ch/~agnet/kaptheoriekh.htm, visited 29/5/07] The marked text segment went straight into the diploma thesis of an Austrian student - of course without any reference to the original text or context found on the web. 22 Figure 12: This document used for plagiarism [Scan from N. N., Wickie und die starken Manner - TV-Kult mit Subtext. Diploma thesis, 2004, p. 15 f.] Please note: This text segment would contain "quote plagiarism" also if the original text from the net was quoted "properly". In humanities you are only allowed to reproduce a quote within another quote when you are unable to obtain the original. With the Neumann/Charlton quote this would not be the case. It can be excluded that plagiarism went the other way round or that both documents have a common third and unknown origin. The Swiss master thesis ranked by Google on third position dated 1998 (see http://visor.unibe.ch/~agnet/Liz%20Kathrin.pdf), the Austrian master thesis was approved in 2004. Also see [Weber 2007a, 74 ff] for a discussion of the example. And it turned out that the Austrian author of the diploma thesis - at that time a scientific assistant at the department of media and communication research at Alpen Adria University Klagenfurt - plagiarised about 40 percent of her whole text (for an online documentation see http://www.wickieplagiat.ja-nee.de). The author of the plagiarised master thesis has been dismissed from her job at the university in August 2006. Sadly enough this was no singular case of cut and paste plagiarism breaking with all known and well- established rules of academic honesty and integrity. One of the authors of this study, Stefan Weber, has meanwhile collected 48 cases of plagiarism which occurred mainly on Austrian (and some German) universities, on the web and in journalism between 2002 and 2007. The spectrum ranges from small term papers completely copied & pasted from one single web source (you need about ten 23 minutes do to this - including layout work) to post-doctoral dissertations and encyclopaedias of renowned professors emeriti. The majority of the cases is connected with some kind of net plagiarism. Two further examples will be discussed in the following. An Austrian political scientist and expert on asymmetric warfare (!) has copied at least 100 pages - and probably much more - from the web into his doctoral thesis in 2002, as always without giving any credit. The "funny" thing is that he even nearly verbatim appropriated the summary and the conclusions of another paper written four years before. The original paper was published as a PDF online, it can be found here: http://cms.isn.ch/public/docs/doc_292_290_de.pdf. This document was written by scientists on the technical university of Zurich in 1998 and counts 106 pages. One can find the summary and the conclusions of this document (and nearly all following chapters) mainly verbatim in a Viennese doctoral thesis from the year 2002. Just compare the following two screenshots: Figure 13: Original text as PDF file on the web [http://cms.isn.ch/public/docs/doc_292_290_de.pdf, p. 5, original text dating 1998, visited 29/5/07] 24 Figure 14: Plagiarised version of the same text [Screenshot from dissertation of N. N., text dating 2002] Start to compare the two texts with "Regionalisation, the shifting of power" in the doctoral thesis and note that the plagiarising author made very small changes ("Regionalization" turned into "Regionalisation"; "Parallel to the decline" turned into "In line with the decline" and so on). Whenever one can identify a clearly copied document with very small changes (1 or 2 words replaced with synonyms or different spelling per sentence), this is a strong indicator for not only a "mistake of the computer" or a "software problem". The third example of net based copy & paste "writing" is the most rigorous one known to us until now: The author - a former student of psychology, coincidentally again at Klagenfurt university - compiled the prose of her doctoral thesis from probably more than one hundred un-cited text fragments from various online sources (including non-scientific ones). Hard to believe, the first 20 pages of the dissertation were not much more than the addition of text chunks found on the following web sites (and of course nothing was referred): http://www.phil-fak.uni-duesseldorf.de/epsycho/perscomp.htm http://www.diplomarbeiten24.de/vorschau/10895.html http://www.bildungsserver.de/zeigen.html?seite=1049 http://www.foepaed.net/rosenberger/comp-arb.pdf http://www.behinderung.org/definit.htm http://www.dieuniversitaet-online.at/dossiers/beitrag/news/behinderung-integration- universitat/83/neste/1.html http://www.behinderung.org/definit.htm http://info.uibk.ac.at/c/c6/bidok/texte/steingruber-recht.html http://info.uibk.ac.at/c/c6/bidok/texte/finding-sehbehindert.html 25 http://www.arbeitundbehinderung.at/ge/content.asp?CID=10003%2C10035 http://www.behinderung.org/definit.htm http://ec.europa.eu/employment_social/missoc/2003/012003/au_de.pdf http://www.grin.com/de/preview/23847.html http://www.bmsg.gv.at/cms/site/attachments/5/3/2/CH0055/CMS1057914735913/behinderten bericht310703b1.pdf http://www.parlinkom.gv.at/portal/page?_pageid=908,221627&_dad=portal&_schema=PORT AL http://www.bpb.de/publikationen/ASCNEC,0,0,Behindertenrecht_und_Behindertenpolitik_in_ der_Europ%E4ischen_Union.html http://ec.europa.eu/employment_social/disability/eubar_res_de.pdf http://ec.europa.eu/public_opinion/archives/ebs/ebs_149_de.pdf This case was also documented in [Weber 2006b]. After media coverage of this drastic example of plagiarism, the university of Klagenfurt decided to control all dissertations and master thesis approved in the last five years for suspicious plagiarism cases with the software Docol c (see chapter 4). An explicit plagiarism warning was published on the web site of the university including an extended definition of plagiarism (http://www.uni-klu.ac.at/main/inhalt/843.htm). To find copied text chunks in the highly questionable dissertation, it is sufficient to start with the very first words of the preface. The author didn't write in the traditional way any more, rather nearly everything was copied from the web. Figure 15: Original from the web... [http://www.phil-fak.uni-duesseldorf.de/epsycho/perscomp.htm, original text dating 1996, visited 29/5/07] 26 Figure 16: ... again used for plagiarism [Scan from dissertation of N. N., 2004, p. 7] Of course we have to mention that Google is not only the problem (as first part of the fatal triad Google, Copy, and Paste), but also one possible solution: All the above documented cases of plagiarism were also detected by googling phrases from the suspicious texts. One can reconstruct this detection easily by just googling a few words from the plagiarised text material (for example "Der Personal Computer wird mehr und mehr" - in most cases, a few words are absolutely sufficient!). But this does not mean that the problem is solved by Google - not at all! We always have to remember that first came the misuse, and then we have the possibility to reveal some plagiarised work, but by far not all plagiarised work. Just one example: Order a diploma thesis from http://www.diplom.de and pay about 80 Euros for the whole text burnt on a CD. If you use that text for plagiarism, Google won't help at all to detect the betrayal. Whenever media cover big cases of plagiarism, journalists ask the few experts in the field: How widespread is this behaviour on universities in the moment? Responsible persons of university managements often speak of very few cases of problematic plagiarism. However, some statistics report that about 30 percent or more students (partly) plagiarised their work at least once a time. Other reports say that about also 30 percent of all submitted works contain plagiarised material.1 However, we will discuss in Section 6 that plagiarism is seen very different in different fields. Most of the careful examination done by one of the authors of this report concentrated on material whose intrinsic value is text-based, quite different from other areas in which the intrinsic value is not the wording, but the new idea, the new formula, new drawing, new computer program. Thus, plagiarism , and particularly the GCP syndrome, is more of a problem in text-only related areas, and less so in engineering disciplines, architecture, etc. Cut and paste plagiarism became a topic for German universities and for the German media after the University of Bielefeld (D) published that in 2001/2002 in a sociological seminar of Professor 1 "Erste Hinweise von Universitatsprofessoren aus dem In- und Ausland lassen jedoch vermuten, da die Erstellung von Plagiaten mithilfe des Internets eine deutlich steigende Tendenz aufweist. So ist zum Beispiel an der University of California (Berkeley/USA) fur einen Zeitraum von drei Jahren (Stichjahr: 1997) eine Zunahme von Tauschungsversuchen um 744 Prozent beobachtet worden." (Zitat aus Fu note (1) von: http://www.hochschulverband.de/presse/plagiate.pdf) 27 Wolfgang Krohn about 25 percent of the submitted papers and about 25 percent of the participants were involved in some kind of plagiarism. No detailed studies on the percentage of increase through the internet are available to us, yet the papers quoted in the Introduction in Appendix 1 all mention a significant increase. Figure 17: Percentage of plagiarism on a German university in 2001/02 (share of cases of plagiarism in blue) [http://www.uni-bielefeld.de/Benutzer/MitarbeiterInnen/Plagiate/iug2001.html, visited 20/5/07] Amongst many surveys on plagiarism worldwide since then two studies delivered highly reliable results: The survey of Donald L. McCabe, executed for the "Center for Academic Integrity" (CAI) at Duke University in the USA between 2002 and 2005 (N>72.950 students and N>9.000 staff members), and a survey executed by OpinionpanelResearch for "Times Higher Education Supplement", executed amongst students in Great Britain in March 2006 (N=1.022 students). Both studies revealed nearly the same fact that about one third of the students already were involved in some kind of plagiarism. Have a look at the data in detail [for a summary also see Weber 2007a, 51 ff.]: 28 Table 2: Plagiarism in the US Survey of Donald L. McCabe, US (N>72.950 students; N>9.000 staff members): Cheating on written Undergraduates* Graduate Students* Faculty** assignments: "Paraphrasing/copying 38 % 25 % 80 % few sentences from written source without footnoting it" "Paraphrasing/copying 36 % 24 % 69 % few sentences from Internet source without footnoting it " "Copying material 7 % 4 % 59 % almost word for word from a written source without citation" * Values represent % of students who have engaged in the behaviour at least once in the past year. ** Values represent % of faculty who have observed the behaviour in a course at least once in the last three years. [Donald L. McCabe, "Cheating among college and university students: A North American perspective", http://www.ojs.unisa.edu.au/journals/index.php/IJEI/article/ViewFile/14/9, 2005, p. 6] Table 3: Plagiarism in GB Survey of Opinionpanel, GB (N=1.022 students): Action: "Since starting university, which of the following have you ever done?" "Copied ideas from a book on my subject" 37 % "Copied text word-for-word from a book on 3 % my subject (excluding quoting)" "Copied ideas from online information" 35 % "Copied text word-for-word from online 3 % information (excluding quoting)" [OpinionpanelResearch, "The Student Panel", paper from Times Higher Education Supplement, personal copy, received July 2006, p. 4] Similar smaller studies were done in Germany and in Austria and led to similar results (for example an Austrian students' online platform did a small survey in June 2006 asking students "Did you ever use texts without citations?", and 31 percent answered with "yes", see Weber 2007a, 55). A recent study carried out as a master thesis on the university of Leipzig (D) revealed that even 90 percent (!) of the students in principle would plagiarise if there is an occasion to do so. The survey was executed online using the randomized-response-technique (RRT) to ensure that the students were willing to fill out very confidential questions with true answers with a higher probability. In sum nearly all studies say the following: 29 1) There is a very high willingness to plagiarise by the current generation of students. The data collected so far are indicators for an increasing culture of hypocrisy and betrayal at universities. 2) About 3 to 7 percent of the students admit some kind of "hardcore plagiarism" also when they were asked in a scientific context (probably in difference to what they would say to peer groups; one author of this report was faced with more than one student telling that he or she was proud of his or her betrayal). 3) About 30 percent admit some kind of "sloppy referencing" or problematic paraphrasing. In many of these cases, we do not know if we should talk of plagiarism or of a single "forgotten" footnote - which in fact must be decided on each singular case. If one bears in mind that there is always a discrepancy between what test persons say about their behaviour in a scientific context and what they actually do, one will soon realise that the current problem of plagiarism and Google-induced copy & paste culture at universities is without any doubt a big one. We think that therefore the problem - interpreted in the context of a general shift of cultural techniques - should move into the centre of the agenda in the academic world. On the other hand, the solution is not so much a-posteriory plagiarism check, but first educating what does constitute a serious plagiarism case (copying a few words without footnote is not acceptable, yet was more considered a small offense compared to the very stringent rules that now start to deal with plagiarism), and second the educational system should make sure that plagiarism cannot occur, or cannot occur easily. We return to this in the lengthy section 14 by introducing two new concepts. Meanwhile, the work definitions of what constitutes plagiarism get more and more draconic. We are right now on the way that also the so called "sloppy referencing" is not tolerated any longer. Just look at the following widespread US definition of plagiarism: "All of the following are considered plagiarism: turning in someone else's work as your own copying words or ideas from someone else without giving credit failing to put a quotation in quotation marks giving incorrect information about the source of a quotation changing words but copying the sentence structure of a source without giving credit copying so many words or ideas from a source that it makes up the majority of your work, whether you give credit or not" [http://turnitin.com/research_site/e_what_is_plagiarism.html, visited 20/5/07] This list in fact means that also a wrong footnote ("giving incorrect information about the source of a quotation") or a text comprising quote after quote (the "majority of your work", that means at least 50 percent must be one's genuine prose!) can constitute plagiarism. In Europe only some universities have adopted strong definitions of plagiarism so far. Most institutions differ between sloppy citation and "real" plagiarism - which of course gives responsible persons at universities the possibility to play down intentional plagiarism as sloppiness. However, we should also mention that the above very rigorous definition of what plagiarism is comes from Turnitin, a company whose business is to sell plagairsim detection software. That such company tries to define plagiarism down to a very fine level of granularity must not come as surprise! At the university of Salzburg 2006 a leading professor refused any consequences for a student who plagiarised at least 47 pages of his diploma thesis verbatim from the web by simple copy & paste - some typing errors and misspellings from the original source remained uncorrected. The responsible person called the copy & paste work of the student "sloppy citation". 30 After some serious cases of plagiarism at the Alpen Adria University Klagenfurt a working group published in 2007 a new and stronger definition of what constitutes plagiarism: "Plagiat ist die unrechtma ige Aneignung von geistigem Eigentum oder Erkenntnissen anderer und ihre Verwendung zum eigenen Vorteil. Die haufigsten Formen des Plagiats in wissenschaftlichen Arbeiten sind: 1) Die wortliche Ubernahme einer oder mehrerer Textpassagen ohne entsprechende Quellenangabe (Textplagiat). 2) Die Wiedergabe bzw. Paraphrasierung eines Gedankengangs, wobei Worter und der Satzbau des Originals so verandert werden, dass der Ursprung des Gedankens verwischt wird (Ideenplagiat). 3) Die Ubersetzung von Ideen und Textpassagen aus einem fremdsprachigen Werk, wiederum ohne Quellenangabe. 4) Die Ubernahme von Metaphern, Idiomen oder eleganten sprachlichen Schopfungen ohne Quellenangabe. 5) Die Verwendung von Zitaten, die man in einem Werk der Sekundarliteratur angetroffen hat, zur Stutzung eines eigenen Arguments, wobei zwar die Zitate selbst dokumentiert werden, nicht aber die verwendete Sekundarliteratur (Zitatsplagiat)." [http://www.uni-klu.ac.at/main/inhalt/843.htm, visited 20/5/07] For the first time at least in Austria also idea plagiarism and even "quote plagiarism" were included (the latter means that you cite a quote you have read elsewhere without checking the original, for example you cite Michel Foucault from the web or from the secondary literature and make a footnote to his original book which you have never read). In the ideal case a scientific work whose value is based mainly on the textual component contains three text categories: 1) Genuine prose written by yourself or by the group of authors listed on the paper. This should be the largest part of the text, because science always has to do with innovation and with your own or the authors' critical reflections of the scientific status quo as reported in the literature and/or of current developments. 2) To prevent that scientific texts are only personal comments on a given topic and to contextualise own reflections into as many others' works and ideas working on the same specific field as possible, direct quotes should be used to reproduce inventive positions and thesis articulated already before the new work was written word-for-word (without changing anything - the real 1:1 reproduction is here essential for the scientific reference system). 3) If you refer to an idea, a genuine concept or also an empirical date you have drawn from elsewhere, you have to refer to the original work by [see Weber] or - as used in German humanities - by [Vgl.] (= compare to). This is usually called indirect referring. There is empirical evidence that the triad of genuine prose, direct quotes and indirect referring is collapsing in the moment. The "Generation Google" or "Generation Wikipedia" tends to produce texts in a radically different way: They do not start writing in an inductive manner, but they deductively come from the texts they have already marked and cut from the web. They re-arrange these texts, write some connecting sentences between the already existing text chunks (what formerly was the genuine prose!) and bring the whole text into a proper form. Appropriation, paraphrasing (which means simple synonym replacement today!) and Shake & Paste are the new cultural techniques. But is this still science? Are we moving towards a text culture without brains? What is happening in the moment? Cognitive capacities get free because searching, reading and writing can be delegated to Google, Copy, and Paste - but for what purpose? There is even no need for rhetoric any more, the bullet points of PowerPoint are sufficient in most contexts (which, in fact, is another big problem of current computer-based learning and teaching). The following graphic shows these changes. 31 Table 4: From human brain involvement to a text culture without brains? Temporal development: cultural break in the last ten years principles of research, logic of scientific discovery and knowledge production (Own) inquest Search Engine Googling SEARCHING GOOGLE JOCKEYING (Close and reflective) Hypertextual Keyword "Power Reading", reading Net Browsing scanning texts, READING SNIPPET CULTURE (Genuine and creative) Appropriation Copy and Paste, bringing all writing and Paraphrasing into a proper form WRITING COPY & PASTE TEXTS Oral Presentation Computer-mediated Bullet point fragmentation (of the core thesis) presentation RHETORICS POWERPOINT KARAOKE [Weber 2007 for this report] The table features some problematic aspects of an evolving text culture without brains [also see Weber 2007a, Kulathuramaiyer & Maurer 2007]. The optimistic way of reading this development is full of (naive?) hope that the cognitive capacities which are freed by Google, Copy & Paste, and PowerPoint will be occupied with other (more?) useful things. A pessimistic way of reading of the tendencies listed above is the diagnosis of a crisis of humanities as such [also see Weber 2006b]: Cases of plagiarism give proof of an increasing redundancy of the knowledge production of humanities. For example: If there already exist hundreds of diploma thesis on the history of the Internet, the students needn't rewrite this again, it is sufficient to produce a collage of found text chunks from the Internet itself (where you will find more than enough "raw material" on the history of the Internet - just try it out). If we take a closer look, we see that this discussion runs into the wrong direction. We should debate on the redundancy of specific topics and not of cultural science as such - it is the non-creativity of the lecturers and professors which is responsible for the current situation and not "the" science. We do not have a real answer to all that happens around us as far as the flood of information is concerned. There are reports that information is doubling each year. On the other hand, knowledge double sonly every 5 to 12 years (depending on area and whom you believe). This difference points out clearly that more is written all the time about the same knowledge, creating by necessity a kind of plagiarism of some ideas, at least! In sum, we observe the spreading of a "culture of mediocrity" [Kulathuramaiyer & Maurer 2007] and a "culture of hypocrisy" in the current academic world. "Mediocrity" also refers to the trends of... 32 * rhetorical bullshitting substituting controversial discussions based on critical reflections; * effects dominating over real contents; * affirmative research PR and techno PR instead of facing the real problems of scientific knowledge production (plagiarism, fraud, and manipulation or fabrication of data]; * and also to the trend of doing more and more trivial research in a micro scale which only has the function to confirm given common sense assumptions ("Mickey Mouse Research", see Weber 2007a, 148 ff.). Let us return to the next scene of the revolution, and therefore again to Google. Google does not only mark the starting point of the Copy Paste Syndrome, it will change or already has begun to change our interactions with texts on a second arena: http://books.google.com. It is no doubt that the digitalisation of 15 millions or more printed books is the next big revolution (and other related initiatives like amazon's "Search Inside"). Information will be accessible for everybody on the web - but for what price, and information in which dose and in which context? Critical voices should not be overheard. Michael Gorman, president of the American Library Association, was cited in "Nature" with the following warning words: "But Gorman is worried that over-reliance on digital texts could change the way people read - and not for the better. He calls it the 'atomization of knowledge'. Google searches retrieve snippets and Gorman worries that people who confine their reading to these short paragraphs could miss out on the deeper understanding that can be conveyed by longer, narrative prose. Dillon agrees that people use e- books in the same way that they use web pages: dipping in and out of the content." [Bubnoff 2005, 552] There are some empirical hints that the reading ability and the ability of the younger generation to understand complex texts is diminishing. The fact that some books like the Harry Potter series are bestsellers hides the fact that many readers are indeed adults and not children! With the possibility to surf through (often parts of!) books and to do keyword search online, the cultural technique of reading a whole text as an entity defined by its author could become obsolete. It is a bit of a paradox: In the scanning streets of Asia, Google and Amazon are digitalising millions of books. But the way how Google Book Search will operate could make an end to all books as textual entities. Thus, the ideas of a core thesis, of a central argument with its branches, of the unfolding of a sophisticated theory or the complex description of empirical research settings and results could turn into anachronisms soon. Instantaneous application knowledge found within a second by key word search is the new thing. In the galaxy of the ubiquitous media flow, will we still have time for reading a whole book? The way Google Book Search could probably change (or put an end to) the production of scientific texts in the academic world once more: A master thesis with at least 100 pages, a dissertation with 200 or 300 pages and a post-doctoral dissertation with even more pages - this could soon be history. Will then also change the ways we define plagiarism? In the end: Are people engaged in fighting plagiarism anachronists by themselves? In the era of copyleft licenses, creative commons and scanning millions of printed books - in many cases without explicit permission of the publishing companies - everything changes so quick that a critical reflection is essential (for optimistic descriptions of the current media situation and for "friendly" future scenarios see for example Johnson 2005, Johnson 1999, Gleich 2002, and Pfeifer 2007; for a critique especially on Google see Jeanneney 2006, Maurer 2007a, b, and c, and Weber 2007a; for a general apocalyptic version of the web development and especially the Web 2.0 see Keen 2007; a relatively neutral discussion of major developments can be found in Witten & Gori & Numerico 2007, here especially Chapter 2 on the digitalisation of printed books). For references see Section 16. 33 Section 3: The new paradigm of plagiarism - and the changing concept of intellectual property (Note: Sections 1-5 are basically material produced by S. Weber after discussions with H. Maurer, with H. Maurer doing the final editing) Plagiarism today transgresses the narrow boundaries of science. It has become a major problem for all social systems: for journalism and economics, for the educational system as well as - as we will show - for religion (!). In the recent years of theory building, sociology tended to describe society by one single macro trend. But this in fact happened about 40 or even more times, so that the contemporary sociological discourse knows many "semantics of society": Just think of "risk society", "society of communication" or "society of simulation". For this report we suggest to add the term "copying society" to the meanwhile long list of self descriptions of society. A copying society is a society in which the distinction between an original and a copy is always crucial and problematic simultaneously, and it is also a society in which we can observe many social processes in the micro, meso, and macro scale with the distinction of "original/copy" (if you think of things or objects) or "genuinely doing/plagiarising" (if you think of actions or processes). Plagiarism is, as mentioned, by far not an exclusive problem of science, and it is by far not an exclusive problem of text culture. The following examples will give an impression of that. Plagiarism and a general crisis of the notion of "intellectual property" are intertwined phenomena. Many discussions are about playing down severe cases of plagiarism by arguing with the freedom of text circulation on the web. There is a big misunderstanding at the moment that for example copyleft licenses imply the rejection of any concept of "plagiarised work". Also people who fight against plagiarism are often accused to maintain a conservative text ideology or at least not to be up-to-date with the current net developments [in this style see for example IG Kultur 2007]. The concepts of authorship and intellectual property are regarded as chains which should be overcome by a free cybernetic circulation of text fragments in the Internet - without any claim of authorship, without any legal constraints. In this confusing situation it is not easy to argue that people who mix the plagiarism debate with the copyleft discussion make a categorical mistake: Everybody should opt for plagiarism-free work - regardless if under a copyright or a copyleft license. Also within the framework of a copyleft license, the texts, ideas, or images published should not imply intellectual theft. The plagiarism debate is about the genesis of an intellectual product, the copyright/copyleft debate about its further distribution. Nevertheless there are some major frictions which don't make the situation easier, especially for the younger Google Copy Paste generation used to the net. Just have a closer look at a paragraph of the "GNU Free Documentation License", an older license for example still prominently used by Wikipedia. By this license the copying of texts, images, and otherwise information is allowed under the condition that the copier publishes the copied version under the same license (alone this would be impossible in the classical academic publishing system!) and is mentioning the source. Furthermore it is even allowed to modify the original version and publish it again - under the same condition of adopting the GNU license: "You may copy and distribute a Modified Version of the Document [...], provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it." [http://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License, in original http://www.gnu.org/copyleft/fdl.html, both visited 20/5/07] 34 Recent publications on free licenses on the web are very euphoric about these developments and nearly totally ignore the problem of plagiarism [see for example Dobusch & Forsterleitner 2007; for a criticism on that book see Weber 2007c]. The current ideology is a bit like: If you want to be hip and give your work a progressive touch, publish your material under a specific Creative Commons license of your taste. Again we have to state that there is little intellectual reflection of this. But if you push a Creative Commons license to its extremes, it could mean the following: Transliterate an interview and publish it under a Creative Commons license. When a modified version is allowed, one can take that interview and change some central sayings. If the interviewed person complains that he or she never said what is published now, tell him or her about your specific license and that you have mentioned the original publication in a correct manner. So at the moment there is really a kind of "reference war" between researchers concerned about (often Google-induced, as already shown) Cut and Paste plagiarism on the web and people thinking euphorically about free licenses. In this report we argue that free licenses on the web (GNU, Creative Commons, and others) should be seen sceptically - especially in the context of scientific or artistic production which means in all social systems concerned with creative processes and intellectual products. To show how widespread in all social systems plagiarism already is, just let us give the following examples. We start with the "chronological" aspect of a questionable socialisation into plagiarism und continue to discuss case studies of plagiarism in various social systems. * The new cultural technique of verbatim appropriation of found text segments can start in childhood if the child already has access to a cellular phone or to a computer: Youngsters for example tend to spread identically worded messages by SMS (this is already quite well documented by empirical media psychology), they also tend to appropriate contingent formulas for the headlines of their nickpages, for "personal" slogans and the reflection of basic attitudes towards life. Sentences like "i NeE Yo O i E! Wi hO yO hE wO(R) i$ bO(R)iNg!" or longer phrases as for example... " ._YoU ArE oNe SpEcIaL PeRsOn Of 1o0o0o_. ._BuT yOu ArE ThE OnE BaBe - WhO i LoVe So DAmN MuCh_. ._ThAnK YoU BiG HeArT FoR ThE MoMeNts YoU LiStn tO mE_. ._Do YoU KnOw.. ThAt YoU ArE SoMeThInG SpEcIaL?_. ._LoVe YoU_. " [Text examples taken from Weber 2007a, 122 ff.] ... give proof of a quite dramatic change in text culture: Text fragments circulate without the need to involve one's brain very much. Text chunks spread in a "memetic" way, the idea of "authorship" or a "real" and "authentic" message doesn't need to come up any more. The new viral text galaxy of redundant "Weblish" or Leetspeak formulas marks the first step towards a text culture without brains, towards a new cognitive distance between what your brain is occupied with and the text transmitted. For a generation which downloads songs, video files, ring tones, and cellular phone logos from the Internet, also the downloading and appropriation of text seems to be something completely normal. It is a big deficit of nearly all current empirical studies dealing with young people and their relationship to new media that the target group was only asked about the downloading of images, videos, and music, but not about the downloading of texts. * The copy & paste culture introduced by Leetspeak formulas is continued with the (mis)use of Wikipedia by many pupils: In the moment there are no empirical investigations on the abuse of Wikipedia texts, but many singular reports and case studies give a hint of a dramatic change in knowledge production: For doing a school presentation on the topic of "Trafalgar Square", it often seems to be sufficient to copy & paste the whole text from the Wikipedia entry. Thus, pupils learn quickly how knowledge is produced in the beginning of the third millennium: Google the key term, then click on the Wikipedia link (which is very often ranked on top position or at least amongst the top three, as proven in Chapter 1), then copy & paste this text into your document (if you are "creative", use Google image search and produce a cool PowerPoint presentation with some images or even short 35 movies not already found on the Wikipedia). An Austrian teacher reported in this context that he warned a pupil not to do copy and paste from Wikipedia. The pupil answered that he doesn't understand this warning: If it's already written in the Wikipedia, it can - and virtually must - be used for a presentation. * When pupils do written assignments, they do not need to write texts on their own any more. They can go to paper mills especially addressed to pupils with countless ready-made school assignments, as for example http://www.schoolunity.de. This web site already welcomes the willing plagiarist with an ethically highly problematic statement: "No feeling like working on yourself? Just search amongst thousands of ready-made written assignments, presentations, specialised texts or biographies and print whatever you need." [From http://www.schoolunity.de - translation by the authors of this report] Of course there is no standardised plagiarism check on this site. The only indicator for the quality of a paper is the mark the "author" got in his or her school. We suppose that many papers on such paper mills are plagiarised - which means when one uses them without quotation he or she plagiarises plagiarism (and with correct citation you still cite plagiarism). The problem of second order plagiarism comes into view: Compare the following two screenshots and you will clearly see that the assignment on http://www.schoolunity.de is already plagiarised from an older written assignment, done for another school. But also in this "original" work the original paragraph from an older book is referenced so sloppy and in such a grubby way that it is impossible to clear the references any more without consulting the original book source [for a documentation and discussion of this example also see Weber 2007a, 59 ff.]. Of course such paper mills always suggest that there is no need to go back to the original source. Figure 18: Example of a written assignment for school (with sloppy citation) [http://gymnasium-damme.de/fachbereich-b/Geschichte/kursakt/geschi/imperialimus/ imperialkurz.doc, p. 21, visited 20/5/07] 36 Figure 19: Plagiarised version of this assignment in a paper mill for school kids [http://www.schoolunity.de/schule/hausaufgaben/preview.php?datensatzkey=001666&query=action%, visited 20/5/07] * When pupils once socialised with the Google Copy Paste technique later come to university, they are confronted with strict scientific citation guidelines whose deeper sense they do not fully understand any more. In their eyes quotation seems to be something annoying, sometimes even something ridiculous (once I heard a student say "Why do I have to cite it, it's clear that I have it from the web!"). Amongst the current generation there is no or only very little feeling for the need of reflecting a source or re-checking that which seems to be evident because it's written on the net. In one case one author of this report proofed that a term paper from a German student published on http://www.hausarbeiten.de (on the Austrian philosopher and psychologist Ernst von Glasersfeld) was fully copied from a years older web source. But the author earned money with plagiarism, and the text could be plagiarised by other students willing to pay 4.99 Euro and thus committing second order plagiarism [the case is documented in Weber 2007e]. * Unfortunately plagiarism is not limited to pupils and students. Stefan Weber collected many cases of plagiarism by teachers and professors. One case dealt with a stolen PowerPoint presentation of a whole term lecture [Weber 2007a, 64], another one with dozens of uncited and paraphrased pages in a post-doctoral dissertation. Another professor copied more than 100 pages of his former colleague into his post-doctoral dissertation. Plagiarism by teachers and professors is always a big problem because the question arises: Which citation ethics, which reference culture do they teach in their courses, and which degree of misuse do they tolerate? * Plagiarism in the Web 2.0: There are many reported cases of plagiarism in Wikipedia [N. N. 2006b], in other Wikipedia clones on the net as well as in weblogs [some cases are discussed in Weber 2007b]. One problem with Wikipedia is the fact that the origin of a text is often uncertain [for a 37 critique see also Weber 2005b and 2007a, 27ff.]. Nobody can control if an original text - later on subject of various changes/adaptations by the net community - is plagiarised or not. For example the text could come from a book source which is not cited properly [also Weber 2005b and 2007a showing concrete examples]. In some way Web 2.0 applications like Wikis and Weblogs produce a second order knowledge galaxy often originating in print sources, but very often with an unclear reference system. This gap makes data, information, and knowledge drawn from the web often problematic. Again the logic of RSS Feeds (e. g. the automatic republishing of news snippets) from the net and the logic of exclusive publishing from the print era collide. Some bloggers also confuse the (legal) RSS feed possibility with the illegal appropriation of a text from another site: Syndication and plagiarism usually are not the same. But nevertheless a constructive dialogue between these two paradigms is absolutely necessary and should be put on top place of the agenda in media science as well as in copyright law. Table 5: Copyright and copyleft paradigm in comparison PRINT LOGIC NET (OR WEB 2.0) LOGIC Copyright paradigm Copyleft paradigm One author or a group of authors Author(s) not necessarily mentioned or nickname(s), avatars,... Publishing companies as publishers Free Licenses Control over distribution by publishers RSS Feeds, free flow of information [Weber 2007 for this report] * As already mentioned, plagiarism transcends scientific knowledge production and the educational system. It also affects nearly every other social system. Plagiarism also transcends computer-based or net-based information, plagiarism was also a (often neglected) concern in the print galaxy. Several cases of book plagiarism have also been detected by Stefan Weber (as described with one case study in Section 4). * In the following we would like to show one interesting example of plagiarism in religion: In the year 2007 an Austrian bishop plagiarised a sermon for Lenten season from another bishop from Switzerland originally dating from 2004. (Please note that the following two screenshots stem from the web sites of two different bishops - there are of course no quotes or references:) 38 Figure 20: Original sermon of a Swiss bishop on the web [http://www.bistum-basel.ch/seite.php?na=2,3,0,35089,d, visited 20/5/07] 39 Figure 21: Plagiarised version of this sermon by an Austrian bishop [http://www.dioezese-linz.at/redaktion/index.php?action_new= Lesen&Article_ID=34667, visited 20/5/07] When accused of plagiarism by an Austrian radio station, the bishop said the text was a common work of both bishops and that he forgot the footnotes because he had to leave for a funeral to Rome. * Plagiarism in journalism is also an often neglected problem [see Fedler 2006]. On the web site http://www.regrettheerror.com Craig Silverman wrote that plagiarism was the biggest problem for journalism in the year 2006. Some cases are reported in which even journalists only did copy & paste from Wikipedia [Weber 2007a, 35]. In another documented case a journalist just took over a PR release from a commercial TV station and sold it to a renowned online magazine as his own story. While in the nineties some severe cases of fabrication shocked media ethics, cases of cut and paste plagiarism are a relatively new way of misconduct in journalism often not seen by the public. Especially in this field we recommend to do much more empirical research. * "Famous" plagiarism cases in the context of politics were sometimes reported worldwide in the media - usually without any consequences for the plagiarists: President Putin was accused to have stolen text in his dissertation from an older American book; the British government was accused to have simply copied text for a decisive Iraq dossier from a ten years older dissertation, and so on. * Also in fiction more than a handful of big plagiarism accusations and also proved cases of plagiarism drew much public attention in the last years. * In arts plagiarism again transcends the domain of texts and also affects stolen melodies from pop hits or plagiarised narratives of cinema movies. 40 * Plagiarism accusations and cases are also reported from all fields of the so-called "creative industries". - The following example shows a case of a supposed logo theft. Figures 22 and 23: Accusation of logo plagiarism The original: Supposed plagiarism: [Both images from http://www.qxm.de/gestaltung/20060626-163801/plagiat-oder-zufall?com=1, visited 20/5/07] At last we have to mention that plagiarism of ideas is always hard to identify. In the following example the PR campaign below (dating from 2006) was accused to be plagiarism of the campaign above from 2002 (for a non-expert observer this might be quite hard to reproduce). 41 Figure 24: Accusation of idea plagiarism in a PR campaign [Comparison from "medianet" print edition, 16 January 2007, p. 13] Not only texts, but also mathematical formulas and illustrating figures can be plagiarised in scientific and otherwise works. In the following (also a good example for plagiarism from an older print source!) you see a page from a plagiarised Austrian post-doctoral dissertation compared to the original. 42 The original page from a concise dictionary (anthology) dating 1979: Figure 25: Original scientific text from 1979 [Scan from Werner Kern (ed.). Handworterbuch der Produktionswirtschaft. Stuttgart: Poeschel, 1979, column 825] 43 The plagiarised page in the post-doctoral dissertation from 1988: Figure 26: Plagiarised scientific text from 1988 [Scan from N. N., Anlagenwirtschaft unter besonderer Akzentuierung des Managements der Instandhaltung. Post-doctoral dissertation, 1988, p. 91] 44 Thinking of all faces of plagiarism mentioned above, we come to the following diagram: Table 6: The various faces of plagiarism Cases of intentional plagiarism occur in... TECHNICAL CHANNEL SOCIAL SYSTEM LEVEL/TYPE OF CONTENT Print/books plagiarism journalism text-based religion images, logos, etc. Net plagiarism, economics otherwise "creative ideas" cut and paste plagiarism science structures, concepts education data politics formulas, etc. arts songs, film narratives, etc. = current focus of public interest [Weber 2007 for this report] One big problem is the fact that the majority of the social systems concerned has no institutionalised routines for dealing with accusations and actual proved cases of plagiarism (we do not speak of the legal dimension here, but of the ethical aspect and the aspect of the eminently important presupposition that we always have to trust and rely upon the knowledge-producing social systems!). There is still a relatively widespread mentality to treat cases of intentional intellectual theft as harmless peccadillos. Often plagiarists as well as responsible persons tend to play down the problem: They speak of a "computer mistake", a "problem with the floppy disk" (when the plagiarism occurred in the eighties or nineties) or of some other kind of sloppiness. In many cases people are more worried about the image of an institution and the prevention of bad media coverage than on what actually happens in the concrete knowledge production: the image comes before the content. In the next Section we would like to discuss some strategies on how to overcome plagiarism and why they all are insufficient so far: The plagiarist always seems to be far ahead of the plagiarism searcher or "hunter" - be it a person or a computer software. For references see Section 16. 45 Section 4: Human and computer-based ways to detect plagiarism - and the urgent need for a new tool (Note: Sections 1-5 are basically material produced by S. Weber after discussions with H. Maurer, with H. Maurer doing the final editing) In the current cheating culture the plagiarists often are superior to the originators or the evaluating teaching staff. Of course this is also a question of generations: The younger people - pupils and students - are more familiar with the latest technological developments than the older generation. In fact, we suppose that a "skills gap" between pupils and students on the one side and the faculty on the other side started about the year of 2000. This was the time when students realised that they could cut and paste texts from the web very easily - and that the chance to be detected was very small. Of course, this can be interpreted and is a sign of progress in the younger generation, yet in this case the progress is exploited in an undesirable way. Meanwhile many professors and lecturers know that they have to google when they once read suspicious text fragments, e. g. written in a highly elaborated prose with tacit knowledge the student simply couldn't have. Usually, such texts are very easy to detect, and it is a shame for some universities that some clearly plagiarised texts remained undetected and were even evaluated with good or very good marks [see the documented cases in Weber 2007a]. But we have to notice that even today, Google - as possible first step into plagiarism - is often helpless when plagiarists use advanced techniques of text theft. Some strategies are mentioned in the following: * A student willing to plagiarise efficiently can order some current master or doctoral thesis in the context he or she should write on by interlending. For example if you study in Berlin, order some master thesis from Munich or Zurich. The web can help the plagiarist: Exclude the possibility that the professors which have judged the other thesis know your own professor closer, exclude that they could be in nearer contact (you can do that by just googling their names with a plus). Exclude that the borrowed master thesis are already as full text versions on the web (again use Google for that check!). Then make a "cross-over", a post-modern mash-up of the ordered thesis. If you have excluded all these risk factors, the probability that your hybrid work will ever be detected is rather small. * A student willing to plagiarise efficiently can order some current master or doctoral thesis by interlending from foreign countries and translate them or let them translate. Again, proceed as described above. And again, you can succeed with this method! * Another method is to use a given text and to replace about every third or fourth noun or verb with a synonym. The disadvantage is that you have to use your brain a little for this method, the big advantage is that current anti-plagiarism software won't detect your betrayal (as we will show below). * Probably still the best method is to plagiarise a print source which isn't already digitized. The problem is that one day it could be digitised. In the following, we will discuss such an example: In 1988 an Austrian professor wrote in his post-doctoral dissertation the following "genuine" text (we suppose the text was genuinely written because there are no quotes, references or footnotes around the text): 46 Figure 27: A suspicious text segment [Scan from N. N., Anlagenwirtschaft unter besonderer Akzentuierung des Managements der Instandhaltung. Post-doctoral dissertation, 1988, p. 135] If you want to do a manual Google check, it is sufficient to select just a few words (usually you needn't search for special elaborated prose parts). In this case it is enough to type "Durch den Einsatz eines Stabes" (the first five words of the paragraph) into Google. Surprisingly, Google only finds one match. With this simple but effective method, it has become possible to detect plagiarism originating from the print era. In this case, the author plagiarised a book first released in 1986 in the year of 1988. Meanwhile the book was digitized and made available online via springerlink.com. Of course you have to pay to obtain the full text of the book chapter. But nevertheless already the fragment found by Google gives enough evidence of plagiarism: Figure 28: The original document found by Google [Screenshot, 18 May 2007] 47 So we come to a central question of our report: If Google's indexing of web sites is as effective as just shown in this example (please note that a document in the paid content area was found!), why didn't Google already develop its own anti-plagiarism or text comparison/text similarity tool? - Steven Johnson already imagined such an "intelligent" text tool in the year of 1999 [Johnson 1999, 157 ff.] - this tool should be able to find texts by tagging keywords "automatically" on a semantic level and also be able to compare them with each other. As we know, "Google Labs" is the place of permanent try-outs of everything by Google. "Google Books" enables the user to browse through millions of books and scan textual snippets (and not to read whole books!). "Google Scholar" enables the user to look up how a term or person is mentioned or cited in scientific texts. There are various other applications in an experimental state dealing with text - but no text comparison or anti-plagiarism tool as far as one can see. Is Google not interested in this topic? Does Google not see the relevancy in current copying society? Or does Google intentionally look away to protect the interests of plagiarists or even of the whole cut and paste culture (in a way, Google itself does cut and paste, for example automatically with "Google News" when fragments of web sites of other news media are fed into their site). Maybe there are ideological reservations. The problem is that Google is not very transparent about such questions. One author of this study mailed to the German Google spokesman two times about text comparison and anti-plagiarism tools, but there absolutely came no response. It is also possible that Google is following the development of the market for anti-plagiarism tools closely and will move in for a "kill" if it turns out to be economically interesting, see Appendix 2. As long as this situation won't change, we can only use Google and our brains to detect plagiarism and hope that always more and more texts will be digitized. - Before we have a look upon various anti- plagiarism tools already on the market, let us differ between plagiarism prevention in advance and plagiarism detection thereafter. In an idealistic situation, intentional plagiarism couldn't occur at all: If all students are really interested in what they search, read and write and if they realise the deeper sense of scientific knowledge production, no fake culture could be introduced. Of course, we don't live in this idealistic context, the world has changed dramatically in the last ten years - due to technological, economical, and political transformations. In the educational system, in some way title marketing replaced the classical concepts of education and enlightenment: Many students do not want to accumulate and critically reflect knowledge any more, instead of that they are in search for the quickest way towards an academic degree. Therefore often the fight against plagiarism on the level of its prevention seems to be a lost game. Two examples of more or less successful "sensitisation strategies" are given. Nearly each academic institute meanwhile warns students not to plagiarise. But in the moment it's not sure if flyers as the following (an example from an Austrian university) are really able to eliminate a kind of "betrayal energy" amongst the students: 48 Figure 29: Plagiarism warning flyer from the University of Klagenfurt, Austria [http://www.uni-klu.ac.at/mk0/studium/pdf/plagiat01.pdf, visited 22/5/07] Another example is taken from a university in Germany. In a PDF available online, some problematic aspects of web references are discussed and also the Google Copy Paste Syndrome (GCPS) is explicitly mentioned: 49 Figure 30: The Google Copy Paste Syndrome mentioned in a brochure from a university in Dortmund, Germany [http://www.fhb.fh-dortmund.de/download_dateien/Literaturrecherche_FHB_Dortmund.pdf, p. 15, visited 22/5/07] An overview of how academic institutions worldwide react after proven cases of plagiarism is given in [Maurer & Kappe & Zaka 2006, 1052 ff.; Note: This paper which was written as part of this report is added as Section 17: Appendix 1]. Please note that there is a broad spectrum of possible consequences depending on the gravity of plagiarism - from an obligatory seminar in science ethics to a temporal relegation from university. Generally one can observe that universities in the US and in GB fight plagiarism still more effectively than most universities in (the rest of) Europe. Especially some reported cases from Austria and Germany show that plagiarism is still played down or even tolerated when it occurred within the faculty. But also for plagiarising professors one could imagine some adequate consequences. In critically commenting a current German case of a plagiarising professor in law, the main German "plagiarism hunter" Debora Weber-Wulff wrote in her blog: "Here we have a clear determination of plagiarism on the part of a professor. Surely something must happen? I can imagine all sorts of things: research funding moratorium; taking away his research assistant the next time it is free; making him take a course in ethics; making the whole department - which appears to have a culture in which such behavior is acceptable - take an ethics course; assigning hours of public service such as doing something about the cataloging backlog in the library or whatever; requesting that he donate the proceeds from the book to financing a course on avoiding plagiarism for students. Surely there must be something which is the equivalent of failing a student for a course in which they submit a plagiarism." [http://copy-shake-paste.blogspot.com, visited 22/5/07] 50 In the following, we will concentrate on digital measures to fight against plagiarism, which means on strategies that are applied after plagiarism prevention (failed). But we should not forget that also effective digital tools have an important preventive didactic function! In the moment, plagiarism software systems have problems with at least four varieties of plagiarism: * plagiarism of print-only sources (with the exception that some print sources could already formerly have been checked by a software which compiles all checked documents in own databases provided that students accepted that storage - than that kind of print plagiarism can be detected). Some systems also have reported problems with online documents only available on pay sites or hidden in the deep web. * plagiarism of sources which are only available in image formats. If you scan a page not already digitized and accessible online and then convert it with OCR software (optical character recognition) into a text format, all current plagiarism software systems won't detect it. * plagiarism of passages written in foreign languages. In the moment, no plagiarism software on the market features an integrated automatic translation function. There are experimental attempts to develop a software which is able to translate text originating from foreign languages into a basic English form with normalised words [see Maurer & Kappe & Zaka 2006, 1079 f.; Note: This paper which was written as part of this report is added as Appendix 1.] * plagiarism of original texts in which many words are replaced by synonyms. To demonstrate that, note the following synonym experiment [for a similar example also see Maurer & Kappe & Zaka 2006, 1067 ff.; Note: This paper which was written as part of this report is added as Appendix 1.] . We take the following two segments copied verbatim from the web: 1) While it has many nicknames, information-age slang is commonly referred to as leetspeek, or leet for short. Leet (a vernacul