How to find news that has “dissapeared” from the Internet?
Published in Journalistic Lessonson 7 - 06 - 2014 Author: Филип Стојановски
This journalistic lesson shows how to use Google, Time.mk Archive, and Archive.org to access content which is no longer available on the websites which have originally published it.
By: Filip Stojanovski, ICT expert, program coordinator at Metamorphosis Foundation
In certain situations there’s public interest to find contents that were published online, but later removed. This journalistic lesson shows how to use tools by Google, Time.mk Archive, and Archive.org to see contents which are no longer available on the websites that have published them originally.
Data transfer over the Internet is done by copying from a computer to a computer. When users “visit” a webpage, they actually request a copy of it from the server where it resides, with number of copies made by intermediaries in between, through the servers of the internet provider to their computer. On the other hand, in order to “find out” what’s online, the search engines make number of internal copies of the webpages, for further analysis. These processes increase the chances for preservation of online content, even if it has been erased from the original website. The probability that something that was published online will “forever” remain “there” is very high. If not forever, than long enough.
Regardless of the motive for erasing of changing, contents may be removed from the web through one of the following manners:
- Direct erasing by the authors, editors, publishers or other persons with administrative privileges on the website. When accessing web address (URL) of such articles, the user gets a message “such page does not exist”;
- Changes in the databases or replacement of the software that powers the website (the content management system). When ordering such changes, some media do not include preserving the functionality of the archives as priority. For instance, Kanal 5 has a website since 1999, but the oldest available news are from 2003;
- Change of the domain or other website attributes. For instance from utrinski.com.mk to utrinski.mk, making some articles unavailable at the time of writing of this article, even theough they appear if built-in search functionalists are used;
- Shutting down of whole websites, as in the case of www.a1.com.mk.
There are two ways to find disappeared news, which can also be combined. The first method is searching using keywords that refer to the news content, especially if you are familiar with the title or personal pronouns that appear within the news body. The second and more efficient way is if you know the exact web address of the disappeared news item. Possibly you could come across this address, or link or URL, by prior searching, for instance via Google. Useful places to search for Macedonian-language contents include:
- Aggregators such as Time.mk, Grid.mk, Daily.mk and Ping.mk;
- Forums, blogs, and bookmarking/discussion sites such as the retired Kajmak;
- Old social media posts, esp. from Twitter or Facebook.
Even though it sounds similar to the word cash, Google Cache refers to a place for temporary storage. In this case, temporary storage of copies of indexed pages. Sometimes, Google provides links to cached pages within their search results. According to their explanation:
Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable. If you click on the “Cached” link, you will see the web page as it looked when we indexed it. The cached content is the content Google uses to judge whether this page is a relevant match for your query.
When the cached page is displayed, it will have a header at the top which serves as a reminder that this is not necessarily the most recent version of the page. Terms that match your query are highlighted on the cached version to make it easier for you to see why your page is relevant.
The “Cached” link will be missing for sites that have not been indexed, as well as for sites whose owners have requested we not cache their content.
If you have the direct link towards the erased or modified webpage, Google provides a simple method to see the cached contents that resided on that address. To do this use the parameter “cache:” within the Google search field, followed by the entire address (URL) in this form:
- Keep in mind that the contents stored within Google Cache have a relatively short lifespan. They are kept within few weeks only. So if you want to find something, you are running against a tight deadline. Do not delay.
- If the webpage suffered several changes, Google Cache typically shows the last recorded version, not the original i.e. the first one.
Macedonian aggregator Time.mk enables finding Macedonian e-contents though its special “Archive” section (time.mk/arhiva). While the basic form of this aggregator gathers information from over 150 websites, it provides limited amount of data regarding each news item: the link, the title, intro… but not entire texts. Its purpose is to redirect towards the original websites, and in order to do it fast it only retains minimum necessary data. On the other hand, the “Archive” section contains data from 16 Macedonian, including important media and the Parliament, as whole, searchable texts. This archive serves as classic search engine when articles that exist on their original websites are concerned – it redirects directly towards them. However, for the erased articles it provides the text (but not the photos or multimedia elements).
- Unlike Google, and the Time.mk aggreggator, which constantly check what’s new on the websites they cover, the Time.mk Archive updates its database occasionally, circa two times per year. This results in versions of the articles which are preserved there are those which have been accessible on their original websites at the moment of “loading.” Articles which had been both published and had been disappeared before November 2012, or which had been published and erased between two “loadings” will not be recorded within this archive.
The Internet Archive
The Internet Archive (Archive.org) is a nonprofit digital library with a mission to enable “universal access to all knowledge.” It provides permanent storage and free public access to collections of digital contents, including copies of websites, music, films and over three million books with expired copyright. Besides the archiving function, it is an activist organization advocating for free and open Internet.
- Internet Archive front page.
Similarly to Google, the software used by the Internet Archive called Wayback Machine has stored several hundreds of billions webpages since 1996. Among others, it archives pages from most of the influential Macedonian media. However, unlike Google, the Arcivhe does not provide keyword search of these stored contents – they are exclusively accessible by inputting the direct web address (URL) of the desired page.
A distinguishing feature of this “Machine,” in regard to the two previous tools, is that it provides opportunity to check all stored versions of the same page, accessible by dates.
- Archive.org enables checking the contents of the A1.com.mk front page over the years.
- In addition, when opening some of the stored pages, in some cases one can use the links within them to access other old pages from the same website. Occasionally they display photos or activate code that displays animations, stored on other websites. In general the Internet Archive keeps only textual content in HTML format and embedded images, but does not store the video files which had been a part of these articles. Therefore, for instance, one can read many of the A1 TV’s news, but not in their entirety. Many of them in the moment of creation were just short blurbs with a goal to incite the readers to see the video, and do not contain full transcripts of the originally made-for-TV news.
Example 1: “Top 10 most read news…”
None of the links of the article “Top 10 most read news on Kurir for 2012” [mk] works. The list contains articles with the photos of alleged suspects for the murder at Smilkovo Lake, about the adventures of media personality Boki 13, sport controversies, promotion of an expanded text of a nationalist chant “Boy, get out…” (mac. Izlezi momche), the hoax of announcement that Facebook will close presented as real news… As most read the list notes the news about the “miracle” in the church of St. Dimitrija in Skopje.
Clicking on any of those links results in a page with text
“404 – The Article #[database record number] has not been found.”
However, all these articles have preserved copies on Archive.org (1, 2, 3, 4, 5, 6, 7, 8, 9, 10), some of them in several versions.
Example 2: “Political Parties use Internet Hounds…”
On November 5, 2010, Sitel TV published a news item titled “Political Parties use Internet ‘Hounds’ to Impose Opinions in the Space for Comments,” covering an interesting topic in a quite balanced manner. Unfortunately, the news item had been removed and does not show up when using the search feature of the original site. Using news title as keywords in the Google search engine shows traces of the news on Ping.mk, Time.mk, Daily.mk, and on the Cotle.ca forum, which can be used to ascertain the original link (URL).
- A copy of an erased article preserved by archive.org.
Visiting the original link results in an English-language message:
“The requested page “/dnevnik/makedonija/partiite-so-internet-%E2%80%9Czagari%E2%80%9C-nametnuvaat-mislenje-vo-prostorot-za-komentari” could not be found.”
Google Cache does not provide a cached copy of this text due to extensive period of time that passed since the harvesting, and it has also not been recorded by the Time.mk Archive because it had been erased before its inception. However, there are preserved copies on Archive.org. Some of the design of these copies differs from the original site, and lacks some of the multimedia elements, including the illustration and the formatting (CSS).
When faced with a need to find data that someone wanted to disappear from the Internet, a combination of the different search methods usually produces best results. Considering the growing trend of leading much of everyday life activities online, this research journalism area will have increased application over time.
This journalistic lesson was created within the framework of the USAID Media Strengthening in Macedonia Project – Media Fact-Checking Service Component, implemented by Metamorphosis. The educative article is made possible by the generous support of the American people through the United States Agency for International Development (USAID). The contents are the responsibility of its author and do not necessarily reflect the views of Metamorphosis, USAID or the United States Government. For more information on the work of USAID in Macedonia please visit its website (macedonia.usaid.gov) and Facebook page (www.facebook.com/USAIDMacedonia).