Friday, August 31, 2007
Modern Toss is a British series of cartoon booklets and books aimed at adults, and a television series based on them. It is the creation of Mick Bunnage and John Link; their company is called Modern Toss Ltd, also going under the name '*hitflap' (apparently a euphemism for 'shitflap').
The cartoons feature low-quality drawing, offensive language, violence, and satire of mundane life. Most of the printed cartoons are one page long and consist of one or sometimes a few frames.
Booklets and books
Thursday, August 30, 2007
History
Mark VII Monorail - Disneyland - 2008-2010 one monorail released a year
Animation Academy - Hong Kong Disneyland - 2007
Mickey's Waterworks - Hong Kong Disneyland - 2007
Twilight Zone Tower of Terror - Walt Disney Studios Park - 2008
Hollywood Boulevard - Walt Disney Studios Park - 2008
Stitch Encounter - Walt Disney Studios Park- 2008
Monsters, Inc.: Ride n Go Seek! - Tokyo Disneyland - 2008
Toy Story Mania - Disney's California Adventure & Disney-MGM Studios - 2008
"it's a small world" - Hong Kong Disneyland - 2008 Current works
Finding Nemo Submarine Voyage - Disneyland - June 11, 2007
Pirate's Lair - Disneyland - May 25, 2007
Toon Studio - Walt Disney Studios Park - 2007
Crush's Coaster - Walt Disney Studios Park - 2007
Cars Race Rally - Walt Disney Studios Park - 2007
Monsters, Inc. Laugh Floor - Magic Kingdom - April 2,2007
Finding Nemo - The Musical - Disney's Animal Kingdom - November, 2006
The Seas with Nemo and Friends - Epcot - October, 2006 Recent projects
The corporate headquarters of Walt Disney Imagineering are in, and have been since the 1950s, Glendale, California. It is for this reason that there is no WDI field office at the Disneyland Resort, which is thirty-six miles away. There are two field offices at the Walt Disney World Resort, required for the sheer size of the resort. Both are relatively close to each other. There are field offices located at;
Epcot and the Disney-MGM Studios, Walt Disney World Resort
Tokyo Disney Resort Administration Building, Tokyo Disney Resort
The former WDFA field office, Disneyland Resort Paris
Walt Disney Imagineering Hong Kong Site Office, Hong Kong Disneyland Resort Locations
The Imagineers have been called on by many other divisions of the Walt Disney Company as well as being contracted by outside firms to design and build structures outside of the theme parks.
The very first Disney Store opened in Glendale, California, mere meters from WDI HQ, and was designed and constructed by a group of architectural Imagineers. The Store remains the only North American Disney Store (other than a Disney Store on the Disney studio lot itself) owned by the Walt Disney Company today.
Environmental and graphic design for The Disney Cruise Line and DCL's Castaway Cay
Imagineering have co-operated with Walt Disney Consumer Products on four more occasions for Disney Stores. Firstly, WDI developed the Walt Disney Gallery at the Main Place Mall in Santa Ana, California (open for a short time in the 1990s, next to the still-operating Disney Store), and then a Roman themed Disney Store at The Forum Shops at Caesars in Las Vegas, Nevada. Two more themed Disney stores were opened in San Francisco, California, and New York City, New York - the latter having been developed into a World of Disney.
After the purchase of the Disney Stores by The Children's Place in 2004, Disney developed a more exclusive chain of flagship Disney stores, called World of Disney (see above). Located in Lake Buena Vista, Florida (at the Walt Disney World Resort), Anaheim, California (at the Disneyland Resort) and New York City. Each have been exclusively designed by Walt Disney Imagineering, an odd feat in recent times as Disney out-contracts work more and more to other companies. A fourth incarnation of the "World of Disney" brand is due to arrive in Disney Village at Disneyland Resort Paris in 2008/2009.
Former Senior Vice President of Imagineering John Hench designed the "Tower of Nations" for the opening and closing ceremony of the 1960 Winter Olympics, where Walt Disney was Pageantry Committee Chairman.
Imagineering designed galleries and exhibitions for the Autry Museum of Western Heritage in Los Angeles, California.
Imagineering developed the Encounter Restaurant, a science fiction themed redesign of the restaurant suspended at the top of the 135-foot parabolic arches of the iconic Theme Building at the Los Angeles International Airport.
Imagineering manufactured flight attendant uniforms for Northwest Airlines from Claude Montana designs in 1989 due in part to the fact that Northwest's then-CEO Al Checchi was also a member of The Walt Disney Company's board. The WDI-made uniforms only lasted until 1992.
When Disney purchased ABC, the Imagineers remodeled the ABC Times Square Studios in New York City.
Imagineering designed exhibits for the Port Discovery children's museum at the Inner Harbor in Baltimore, Maryland. Imagineers
Hench, John, with Peggy Van Pelt. Designing Disney: Imagineering and the Art of the Show. Disney Editions, 2003, ISBN 0-7868-5406-5.
Imagineers, The. Walt Disney Imagineering: A Behind the Dreams Look At Making the Magic Real. Disney Editions, 1996, ISBN 0-7868-6246-7 (hardcover); 1998, ISBN 0-7868-8372-3 (paperback).
Imagineers, The. The Imagineering Way: Ideas to Ignite Your Creativity. Disney Editions, 2003, ISBN 0-7868-5401-4.
Imagineers, The (as "The Disney Imagineers"). The Imagineering Workout: Exercises to Shape Your Creative Muscles. Disney Editions, 2005, ISBN 0-7868-5554-1.
Imagineers, The. The Imagineering Field Guide to Epcot at Walt Disney World. Disney Editions, 2006, ISBN 0-7868-4886-3.
Imagineers, The. The Imagineering Field Guide to Magic Kingdom at Walt Disney World. Disney Editions, 2005, ISBN 0-7868-5553-3.
Kurtti, Jeff. Walt Disney's Legends of Imagineering and the Genesis of the Disney Theme Park. Disney Editions, 2006, ISBN 0-7868-5559-2.
Mark VII Monorail - Disneyland - 2008-2010 one monorail released a year
Animation Academy - Hong Kong Disneyland - 2007
Mickey's Waterworks - Hong Kong Disneyland - 2007
Twilight Zone Tower of Terror - Walt Disney Studios Park - 2008
Hollywood Boulevard - Walt Disney Studios Park - 2008
Stitch Encounter - Walt Disney Studios Park- 2008
Monsters, Inc.: Ride n Go Seek! - Tokyo Disneyland - 2008
Toy Story Mania - Disney's California Adventure & Disney-MGM Studios - 2008
"it's a small world" - Hong Kong Disneyland - 2008 Current works
Finding Nemo Submarine Voyage - Disneyland - June 11, 2007
Pirate's Lair - Disneyland - May 25, 2007
Toon Studio - Walt Disney Studios Park - 2007
Crush's Coaster - Walt Disney Studios Park - 2007
Cars Race Rally - Walt Disney Studios Park - 2007
Monsters, Inc. Laugh Floor - Magic Kingdom - April 2,2007
Finding Nemo - The Musical - Disney's Animal Kingdom - November, 2006
The Seas with Nemo and Friends - Epcot - October, 2006 Recent projects
The corporate headquarters of Walt Disney Imagineering are in, and have been since the 1950s, Glendale, California. It is for this reason that there is no WDI field office at the Disneyland Resort, which is thirty-six miles away. There are two field offices at the Walt Disney World Resort, required for the sheer size of the resort. Both are relatively close to each other. There are field offices located at;
Epcot and the Disney-MGM Studios, Walt Disney World Resort
Tokyo Disney Resort Administration Building, Tokyo Disney Resort
The former WDFA field office, Disneyland Resort Paris
Walt Disney Imagineering Hong Kong Site Office, Hong Kong Disneyland Resort Locations
The Imagineers have been called on by many other divisions of the Walt Disney Company as well as being contracted by outside firms to design and build structures outside of the theme parks.
The very first Disney Store opened in Glendale, California, mere meters from WDI HQ, and was designed and constructed by a group of architectural Imagineers. The Store remains the only North American Disney Store (other than a Disney Store on the Disney studio lot itself) owned by the Walt Disney Company today.
Environmental and graphic design for The Disney Cruise Line and DCL's Castaway Cay
Imagineering have co-operated with Walt Disney Consumer Products on four more occasions for Disney Stores. Firstly, WDI developed the Walt Disney Gallery at the Main Place Mall in Santa Ana, California (open for a short time in the 1990s, next to the still-operating Disney Store), and then a Roman themed Disney Store at The Forum Shops at Caesars in Las Vegas, Nevada. Two more themed Disney stores were opened in San Francisco, California, and New York City, New York - the latter having been developed into a World of Disney.
After the purchase of the Disney Stores by The Children's Place in 2004, Disney developed a more exclusive chain of flagship Disney stores, called World of Disney (see above). Located in Lake Buena Vista, Florida (at the Walt Disney World Resort), Anaheim, California (at the Disneyland Resort) and New York City. Each have been exclusively designed by Walt Disney Imagineering, an odd feat in recent times as Disney out-contracts work more and more to other companies. A fourth incarnation of the "World of Disney" brand is due to arrive in Disney Village at Disneyland Resort Paris in 2008/2009.
Former Senior Vice President of Imagineering John Hench designed the "Tower of Nations" for the opening and closing ceremony of the 1960 Winter Olympics, where Walt Disney was Pageantry Committee Chairman.
Imagineering designed galleries and exhibitions for the Autry Museum of Western Heritage in Los Angeles, California.
Imagineering developed the Encounter Restaurant, a science fiction themed redesign of the restaurant suspended at the top of the 135-foot parabolic arches of the iconic Theme Building at the Los Angeles International Airport.
Imagineering manufactured flight attendant uniforms for Northwest Airlines from Claude Montana designs in 1989 due in part to the fact that Northwest's then-CEO Al Checchi was also a member of The Walt Disney Company's board. The WDI-made uniforms only lasted until 1992.
When Disney purchased ABC, the Imagineers remodeled the ABC Times Square Studios in New York City.
Imagineering designed exhibits for the Port Discovery children's museum at the Inner Harbor in Baltimore, Maryland. Imagineers
Hench, John, with Peggy Van Pelt. Designing Disney: Imagineering and the Art of the Show. Disney Editions, 2003, ISBN 0-7868-5406-5.
Imagineers, The. Walt Disney Imagineering: A Behind the Dreams Look At Making the Magic Real. Disney Editions, 1996, ISBN 0-7868-6246-7 (hardcover); 1998, ISBN 0-7868-8372-3 (paperback).
Imagineers, The. The Imagineering Way: Ideas to Ignite Your Creativity. Disney Editions, 2003, ISBN 0-7868-5401-4.
Imagineers, The (as "The Disney Imagineers"). The Imagineering Workout: Exercises to Shape Your Creative Muscles. Disney Editions, 2005, ISBN 0-7868-5554-1.
Imagineers, The. The Imagineering Field Guide to Epcot at Walt Disney World. Disney Editions, 2006, ISBN 0-7868-4886-3.
Imagineers, The. The Imagineering Field Guide to Magic Kingdom at Walt Disney World. Disney Editions, 2005, ISBN 0-7868-5553-3.
Kurtti, Jeff. Walt Disney's Legends of Imagineering and the Genesis of the Disney Theme Park. Disney Editions, 2006, ISBN 0-7868-5559-2.
Tuesday, August 28, 2007
General Sir (Frederick) Stanley Maude KCB (1916), CMG, DSO (24 June 1864 - 18 November 1917) was a British commander, most famous for his efforts in Mesopotamia during World War I and for conquering Baghdad in 1917.
Early life
Maude was born in Gibraltar into a military family; his father was Sir Frederick Francis Maude – a general who had been awarded the Victoria Cross in 1855 during the Crimean War.
Family
Maude attended Eton College and then Sandhurst military college. He graduated in 1883 and joined the Coldstream Guards in February 1884.
Education
Maude first saw active service in Egypt from March to September 1885, where he was awarded the Egyptian Medal and the Khedive's Egyptian Star. He next saw service as a Major during the Second Boer War, where he served from January 1900 to March 1901, he won a DSO and the Queen's South African Medal. From 1902 to 1904, he served on the staff of the Governor-General of Canada. He returned to Britain to become second-in-command at the Coldstream Guards and then he joined the General Staff, was promoted to Lieutenant-Colonel in 1907 and Colonel in 1911.
Service
World War I
In World War I, Maude first served in France. He was a staff officer with III Corps when, in October 1914, he was promoted to Brigadier-General and given command of the 14th Brigade. He was wounded in April 1915 and returned home to recover. He returned to France in May and, in June, he was promoted to Major-General and transferred to command the 33rd Division, then still in training.
Monday, August 27, 2007
Otherware, sometimes called requestware, is a collective term referring to software that is not distributed as freeware, shareware or commercial software. Usually, otherware requests the user to do something other than paying to the software author; therefore, it may be considered a type of freeware. The action requested does not give the user any benefits, however, and the author can't really confirm that the user actually did what he was asked to do.
Sunday, August 26, 2007
The pound was the currency of Rhode Island until 1793. Initially, the British pound and foreign coins circulated, supplemented by local paper money from 1710. Although these notes were denominated in pounds, shillings and pence, they were worth less than sterling, with 1 Rhode Island shilling = 9 pence sterling. The first issue of notes was known as the "Old Tenor" issue. This fell in value and "New Tenor" notes were introduced in 1740, worth 4 times the Old Tenor notes. Both Old and New Tenor notes were replaced in 1763 by "Lawful money" at a rate of 1 Lawful shilling = 6⅔ New Tenor shillings = 26⅔ old Tenor shillings.
The State of Rhode Island issued Continental currency denominated in £sd and Spanish dollars, with 1 dollar = 6 shillings. The continental currency was replaced by the U.S. dollar at a rate of 1000 continental dollars = 1 U. S dollar.
Saturday, August 25, 2007
Until 966 966–1385 1385–1569 1569–1795 1795–1918 1918–1939 1939–1945 1945–1989 1989–present
Culture Demography (Jews) Economics Politics (Monarchs and Presidents) Military (Wars) Territorial changes (WWII)
Until 1795 Poland, or at least its nucleus, was ruled at various times either by książęta (dukes, ca. 960 – 1025, 1032–1076, 1079–1295, 1296–1300 and 1306–1320) or by kings (1025–1031, 1076–1079, 1295–1296, 1300–1306 and 1320–1795). The longest-reigning dynasties were the Piasts (ca. 960 – 1370) and Jagiellons (1386–1572). Intervening and subsequent monarchs were often also foreign rulers, or princes recruited from foreign dynasties. Polish independence ended with the Third Partition of the Polish-Lithuanian Commonwealth (1795) and was restored at the end of World War I (1918) on a republican basis.
Kingdom of Poland of the Piasts
Early Piasts
Piast Dynasty
Fragmentation
Piast Dynasty
Přemyslid Dynasty
Late Piasts
Piast Dynasty
Kingdom of Poland of the Jagiellons
Angevin Dynasty
Jagiellon Dynasty
Polish-Lithuanian Commonwealth
Valois Dynasty
Jagiellon Dynasty
Vasa Dynasty
House of Wiśniowiecki
House of Sobieski
Wettin Dynasty
House of Leszczyński
Wettin Dynasty
House of Leszczyński
Wettin Dynasty
House of Poniatowski
Partitions, 1795-1918
Kingdom of Galicia and Lodomeria
Habsburg Dynasty
Duchy of Warsaw
Wettin Dynasty
Congress Kingdom
Romanov Dynasty
Hohenzollern Dynasty
Royal coronations in Poland
Royal Coronations at Wawel Cathedral
Dukes of Greater Poland
Dukes of Masovia
Dukes of Pomerania
Dukes of Sieradz-Łęczyca
Dukes of Silesia
List of Galician rulers
Friday, August 24, 2007
English law, the legal system of England and Wales, is the basis of common law legal systems throughout the world (as opposed to civil law or pluralist systems in other countries, such as Scots law). It was exported to Commonwealth countries while the British Empire was established and maintained, and it forms the basis of the jurisprudence of most of those countries. English law prior to the American revolution is still part of the law of the United States, except in Louisiana, and provides the basis for many American legal traditions and policies, though it has no superseding jurisdiction.
The essence of English common law is that it is made by judges sitting in courts, applying their common sense and knowledge of legal precedent (stare decisis) to the facts before them. A decision of the highest appeal court in England and Wales, the House of Lords, is binding on every other court in the hierarchy, and they will follow its directions. For example, there is no statute making murder illegal. It is a common law crime - so although there is no written Act of Parliament making murder illegal, it is illegal by virtue of the constitutional authority of the courts and their previous decisions. Common law can be amended or repealed by Parliament; murder, by way of example, carries a mandatory life sentence today, but had previously allowed the death penalty.
England and Wales are constituent countries of the United Kingdom, which is a member of the European Union and EU law is effective in the UK. The European Union consists mainly of countries which use civil law and so the civil law system is also in England in this form, and the European Court of Justice, a predominantly civil law court, can direct English and Welsh courts on the meaning of EU law.
The oldest law currently in force is the Distress Act 1267, part of the Statute of Marlborough, (52 Hen. 3). Three sections of Magna Carta, originally signed in 1215 and a landmark in the development of English law, are still extant, but they date to the reissuing of the law in 1297.
England and Wales as a distinct jurisdiction
See also Contemporary Welsh Law
Although devolution has accorded some degree of political autonomy to Wales in the National Assembly for Wales, it does not currently have sovereign law-making powers until after the 2007 Welsh general election when the Government of Wales Act 2006 grants powers to the Welsh Assembly Government to produce some primary legislation. The legal system administered through both civil and criminal courts will remain unified throughout England and Wales. This is different from the situation of Northern Ireland, for example, which did not cease to be a state when its legislature was suspended (see Northern Ireland (Temporary Provisions) Act 1972).
A major difference is also the use of the Welsh language, as laws concerning it apply in Wales and not in England. The Welsh Language Act 1993 is an Act of the Parliament of the United Kingdom, which put the Welsh language on an equal footing with the English language in Wales with regard to the public sector. Welsh can also be spoken in Welsh courts.
Wales
The Interpretation Act 1978, Schedule 1 distinctively identifies the following: "British Islands", "England", and "United Kingdom". The use of the term "British Isles" is virtually obsolete in statutes and, when it does appear, it is taken to be synonymous with "British Islands". For interpretation purposes, England includes a number of specified elements:
"Great Britain" means England and Scotland including its adjacent territorial waters and the islands of Orkney and Shetland, the Hebrides, and Rockall (by virtue of the Island of Rockall Act 1972). The "United Kingdom" means Great Britain and Northern Ireland and their adjacent territorial waters. It does not include the Isle of Man; nor the Channel Islands, whose independent status was discussed in Rover International Ltd. v Canon Film Sales Ltd. (1987) 1 WLR 1597 and Chloride Industrial Batteries Ltd. v F. & W. Freight Ltd. (1989) 1 WLR 823. The "British Islands" means the "United Kingdom", the Isle of Man, and the Channel Islands.
Wales and Berwick Act 1746, section 3 (entire Act now repealed) formally incorporated Wales and Berwick-upon-Tweed into England. But section 4 Welsh Language Act 1967 provided that references to England in future Acts of Parliament should no longer include Wales (see now Interpretation Act 1978, Schedule 3, part 1). But Dicey & Morris say (at p28) "It seems desirable to adhere to Dicey's [the original] definition for reasons of convenience and especially of brevity. It would be cumbersome to have to add "or Wales" after "England" and "or Welsh" after "English" every time those words are used."
the "adjacent islands" of the Isle of Wight and Anglesey are a part of England and Wales by custom, while Harman v Bolt (1931) 47 TLR 219 expressly confirms that Lundy is a part of England.
the "adjacent territorial waters" by virtue of the Territorial Waters Jurisdiction Act 1878 and the Continental Shelf Act 1964 as amended by the Oil and Gas Enterprise Act 1982. Statutory framework
Since 1189, English law has been described as a common law rather than a civil law system (i.e. there has been no major codification of the law, and judicial precedents are binding as opposed to persuasive). In the early centuries, the justices and judges were responsible for adapting the Writ system to meet everyday needs, applying a mixture of precedent and common sense to build up a body of internally consistent law, e.g. the Law Merchant began in the Pie-Powder Courts (a corruption of the French "pieds-poudrés" or "dusty feet", meaning ad hoc marketplace courts). As Parliament developed in strength, and subject to the doctrine of separation of powers, legislation gradually overtook judicial law making so that, today, judges are only able to innovate in certain very narrowly defined areas. Time before 1189 was defined in 1276 as being time immemorial.
Common law
One of the major problems in the early centuries was to produce a system that was certain in its operation and predictable in its outcomes. Too many judges were either partial or incompetent, acquiring their positions only by virtue of their rank in society. Thus, a standardised procedure slowly emerged, based on a system termed stare decisis. Thus, the ratio decidendi of each case will bind future cases on the same generic set of facts both horizontally and vertically. The highest appellate court in the UK is the House of Lords (the judicial members of which are termed Law Lords or, specifically if not commonly Lords of Appeal in Ordinary) and its decisions are binding on every other court in the hierarchy which are obliged to apply its rulings as the law of the land. The Court of Appeal binds the lower courts, and so on. Since joining what is now termed the European Union, European Union Law has direct effect in the UK, and the decisions of the European Court of Justice bind the UK courts.
Precedent
The influences are two-way.
The United Kingdom exported its legal system to the Commonwealth countries during the British Empire, and many aspects of that system have persisted after the British withdrew or granted independence to former dominions. English law prior to the Wars of Independence is still an influence on United States law, and provides the basis for many American legal traditions and policies. Many states that were formerly subject to English law (such as Australia) continue to recognise a link to English law - subject, of course, to statutory modification and judicial revision to match the law to local conditions - and decisions from the English law reports continue to be cited from time to time as persuasive authority in present day judicial opinions. For a few states, the British Privy Council remains the ultimate court of appeal. Many jurisdictions which were formerly subject to English law (such as Hong Kong) continue to recognise the common law of England as their own - subject, of course, to statutory modification and judicial revision - and decisions from the English Reports continue to be cited from time to time as persuasive authority in present day judicial opinions.
The UK is a dualist in its relationship with international laws, i.e. international obligations have to be formally incorporated into English law before the courts are obliged to apply supranational laws. For example, the European Convention on Human Rights and Fundamental Freedoms was signed in 1950 and the UK has allowed individuals to make complaints to the European Commission on Human Rights since 1966. Now s6(1) Human Rights Act 1998 (HRA) makes it unlawful "... for a public authority to act in a way which is incompatible with a convention right", where a "public authority" is any person or body which exercises a public function, expressly including the courts but expressly excluding Parliament. Although the European Convention has begun to be applied to the acts of non-state agents, the HRA does not make the Convention specifically applicable between private parties. Courts have taken the Convention into account in interpreting the common law. They also must take the Convention into account in interpreting Acts of Parliament, but must ultimately follow the terms of the Act even if inconsistent with the Convention (s3 HRA).
Similarly, because the UK remains a strong international trading nation, international consistency of decision making is of vital importance, so the Admiralty is strongly influenced by Public International Law and the modern commercial treaties and conventions regulating shipping. Overseas influences
English law has significant antiquity. The oldest law currently in force is the Distress Act 1267, part of the Statute of Marlborough (52 Hen. 3). Three sections of Magna Carta, originally signed in 1215 and a landmark in the development of English law, are still extant, but they date to the reissuing of the law in 1297.
Statute
Subjects and links
English Criminal Code
Intention in English law
Causation in English law
Manslaughter in English law
Murder in English law
Theft in English law
Robbery in English law
Necessity in English law
Provocation in English law
Duress in English law
Self-defence in English law
Diminished responsibility in English law Criminal law
Fundamental laws of England Family law
Main article: English tort law Tort
Consideration under English law
Estoppel (English law)
Measure of Damages (under English law) Contract
Chose (English law) Evidence
Licensing law in England and Wales
Residence in English law
Consideration under English law
Estoppel (English law)
Measure of Damages (under English law) Contract
Chose (English law) Evidence
Licensing law in England and Wales
Residence in English law
Thursday, August 23, 2007
Florida Atlantic University
Florida Atlantic University, also commonly referred to as FAU or Florida Atlantic, is a public coeducational research university located in Boca Raton, Florida, USA. The university has six additional partner campuses located in the Florida cities of Dania Beach, Davie, Fort Lauderdale, Jupiter, Port St. Lucie, and Fort Pierce at the Harbor Branch Oceanographic Institution.
History
Academics
Florida Atlantic University's student body consists of 22,181 undergraduates and 3,476 graduate and professional students. The undergraduate student body, containing 42% ethnic minorities, come from 144 countries, 48 states and the District of Columbia. For the undergraduate class of 2010, the acceptance rate was 58%.
Profile
FAU is currently classified as a Research University with high research activity by The Carnegie Foundation for the Advancement of Teaching.
Research
Florida Atlantic has been ranked among American universities by a number of publications throughout its history. In 2007 FAU was classified a 4th tier university by the U.S. News & World Report's rankings of "Best Colleges".
Rankings
Florida Atlantic University is a distributed university located on seven campuses spread across three counties of Florida's eastern coastline.
Campus
Palm Beach County Campuses
See also: Florida Atlantic University Stadium
Florida Atlantic's main campus in Boca Raton was established on the remnants of a World War II American Army airbase in 1964. Spanning 850 acres (3.5 km²) near the Atlantic Ocean, the site is located between the cities of Palm Beach and Fort Lauderdale. The campus was designated a burrowing owl sanctuary in 1971, by the Audubon Society.
The Boca campus also houses a number of other programs including the A.D. Henderson University School, FAU High School, one of two FAU Research Parks, and the Lifelong Learning Society.
Boca Raton
In addition to the Boca campus in southern Palm Beach County, Florida Atlantic operates a campus in northern Palm Beach County in Jupiter. The John D. MacArthur Campus, named after businessman and philanthropist John D. MacArthur, was established in 1999 to serve residents of central and northern Palm Beach and southern Martin Counties. "The campus currently occupies approximately 45 acres (.18 km²) with 18 buildings totaling more than 333,000 square feet: eight classroom/office buildings, a library, a 500-seat auditorium, two residence halls, a dining hall, museum building and central utility plant."
Broward County Campuses
The Dania Beach Campus, also known as SeaTech, was founded in 1997 as "a state-funded Type II research center, the institute is part of Florida Atlantic's Department of Ocean Engineering."
Dania Beach - SeaTech
The Davie Campus of Florida Atlantic University was established in 1990 on 38 acres (.15 km²) of land in western Broward County.
Davie
The Fort Lauderdale Campus, located in the heart of downtown Fort Lauderdale, "provides a laboratory for students in business, computer arts, architecture, urban and regional planning, criminal justice, social work, and public administration."
Fort Lauderdale
St. Lucie County Campuses
Located in Port St. Lucie, Florida the Treasure Coast Campus of Florida Atlantic University operates under a unique partnership with Indian River Community College (IRCC). Since the 1970s FAU has been operating on the Treasure Coast in conjunction with IRCC "to extend educational opportunities that take students from an associate's degree to undergraduate and graduate degrees."
Port. St. Lucie - Treasure Coast Campus
In addition to the Treasure Coast Campus, FAU operates a campus jointly in Fort Pierce with the Harbor Branch Oceanographic Institution (HBOI). While this partnership began with informal research ties more than a decade ago, in recent years the partnership has solidified with the construction of an FAU research and teaching facility on Harbor Branch's 600 acre (2.4 km²) campus. This facility was constructed with $11 million in state appropriations.
Fort Pierce - HBOI
Main article: Florida Atlantic Owls Athletics
Since the inception of the athletics program, a number of sports-related traditions and school spirit organizations have been started at the university. One new tradition is "Bury the Burrow in Red," which calls for Florida Atlantic students to wear as much red as possible and fill the Burrow, the university's multi-purpose arena, during the annual basketball face-off between FAU and nearby neighbor Florida International University (FIU).
Traditions
Student Life
Residential housing at Florida Atlantic University is available on the Boca Raton and John D. MacArthur campuses. "All full-time freshman students are required to live in university housing," however, "exceptions are made for a number of reasons including residing with a parent or legal guardian within a 50-mile commutable distance from the campus, a student being 21 years of age, or if a student is married."
Residential life
For the 2006-2007 academic year, Florida Atlantic had approximately 150 registered student organizations. This includes 40 academic organizations, 19 honor societies, 18 spiritual/religious organizations, 16 diversity appreciation organizations, 5 service organizations, 25 personal interest organizations, 12 club-sports, and 7 student government agencies. These clubs and organizations run the gamut from sailing to Ultimate Frisbee, from varsity and club sports and a jazz group to a pottery guild, from political organizations to chess and video game clubs.
Greek life
Since the inception of the athletics program, a number of sports-related traditions and school spirit organizations have been started at the university. One new tradition is "Bury the Burrow in Red," which calls for Florida Atlantic students to wear as much red as possible and fill the Burrow, the university's multi-purpose arena, during the annual basketball face-off between FAU and nearby neighbor Florida International University (FIU).
Traditions
Student Life
Residential housing at Florida Atlantic University is available on the Boca Raton and John D. MacArthur campuses. "All full-time freshman students are required to live in university housing," however, "exceptions are made for a number of reasons including residing with a parent or legal guardian within a 50-mile commutable distance from the campus, a student being 21 years of age, or if a student is married."
Residential life
For the 2006-2007 academic year, Florida Atlantic had approximately 150 registered student organizations. This includes 40 academic organizations, 19 honor societies, 18 spiritual/religious organizations, 16 diversity appreciation organizations, 5 service organizations, 25 personal interest organizations, 12 club-sports, and 7 student government agencies. These clubs and organizations run the gamut from sailing to Ultimate Frisbee, from varsity and club sports and a jazz group to a pottery guild, from political organizations to chess and video game clubs.
Greek life
Main article: List of Florida Atlantic University people
Wednesday, August 22, 2007
Yeongdeungpo-gu is an administrative district in southwest Seoul, Korea. Although the origin of the name is uncertain, the first two syllables are thought to be from "yeongdeung" (靈登) or "divine ascent", a shamanic rite. The third syllable is "po", representing water (浦), referring to the district's position on the Han River. The 2006 population was 408,819. The current magistrate is Kim Hyung-Su.
There are 22 administrative "dong" and 34 legal "dong". Yeouido Dong is the largest in area and takes about 34% of the land. The total area is 24.56 km² (2004), making up 4% of Seoul's land. The annual budget is approximately 2 billion won.
Yeongdeungpo Gu has been heavily developed as an office, commercial, and residential district. Yeouido Dong is home to the famous DLI 63 Building, the highest office building in South Korea and currently the 3rd tallest building in the country.
The National Assembly Building is located in Yeouido-dong, in Yeongdeungpo Gu.
Notes
List of Korea-related topics
Geography of South Korea
Subdivisions of South Korea
Administrative divisions of Seoul
Tuesday, August 21, 2007
Dennis John Kucinich (born October 8, 1946) is an American politician of the Democratic party and a candidate for President of the United States in both 2004 and 2008.
Kucinich currently represents the 10th District of Ohio in the United States House of Representatives. His district includes most of western Cleveland, as well as such suburbs as Parma and Cuyahoga Heights. He is currently the chairman of the Domestic Policy Subcommittee of the House Committee on Oversight and Government Reform.
From 1977 to 1979, Kucinich served as the 53rd mayor of Cleveland, Ohio, a tumultuous term in which he survived a recall election and was successful in a battle against selling the municipal electric utility before being defeated for reelection by George V. Voinovich in 1979.
Personal details
Kucinich's political career began early. After running unsuccessfully in 1967, Kucinich was elected to the Cleveland City Council in 1969, when he was 23.
Early career
Main article: Mayoral administration of Dennis Kucinich Cleveland Mayorativity, 1977–1979
After losing his re-election bid for Mayor to George Voinovich in 1979, Kucinich kept a low-profile in Cleveland politics. He criticized a tax referendum proposed by Voinovich in 1980, which voters eventually approved. He also struggled to find employment and moved to Los Angeles, California where he stayed with a friend, actress Shirley MacLaine. "He was in political Siberia in the 1980s," said Joseph Tegreene years later. "It was only when it became clear to people that he was right... he got belated recognition for the things that he did."
Post-mayorship
In 1996, Kucinich was elected to the U.S. House of Representatives, representing the 10th district of Ohio. He defeated two-term Republican incumbent Martin Hoke in what is still regarded as an upset given the 10th's historic Republican lean; however, he has not faced serious opposition since.
He serves on the Congressional Education and Labor Committee as well as the Government Reform Committee. He is a member of the Congressional Progressive Caucus and is a self-described "Wellstone Democrat."
House of Representatives
After losing his re-election bid for Mayor to George Voinovich in 1979, Kucinich kept a low-profile in Cleveland politics. He criticized a tax referendum proposed by Voinovich in 1980, which voters eventually approved. He also struggled to find employment and moved to Los Angeles, California where he stayed with a friend, actress Shirley MacLaine. "He was in political Siberia in the 1980s," said Joseph Tegreene years later. "It was only when it became clear to people that he was right... he got belated recognition for the things that he did."
Post-mayorship
In 1996, Kucinich was elected to the U.S. House of Representatives, representing the 10th district of Ohio. He defeated two-term Republican incumbent Martin Hoke in what is still regarded as an upset given the 10th's historic Republican lean; however, he has not faced serious opposition since.
He serves on the Congressional Education and Labor Committee as well as the Government Reform Committee. He is a member of the Congressional Progressive Caucus and is a self-described "Wellstone Democrat."
House of Representatives
Main article: Political positions of Dennis Kucinich Domestic policy voting record
Kucinich has criticized the foreign policy of President Bush, including the 2003 invasion of Iraq and what Kucinich perceives to be building American hostility towards Iran. Kucinich and Ron Paul are the only candidates who voted against the 2003 invasion of Iraq. He has since voted against funding it 100% of the time. In 2005, Kucinich voted against the Iran Freedom and Support Act, calling it a "stepping stone to war."
2004 presidential campaign
On December 10, 2003, the American Broadcasting Company (ABC) announced the removal of its correspondents from the campaigns of Kucinich, Carol Moseley Braun and Al Sharpton.
Press coverage
In the 2004 Democratic presidential nomination race, national polls consistently showed Kucinich's support in single digits, but rising, especially as Howard Dean lost some support among peace activists for refusing to commit to cutting the Pentagon budget. Though he was not viewed as a viable contender by most, there were differing polls on Kucinich's popularity.
He placed second in MoveOn.org's primary, behind Dean. He also placed first in other polls, particularly Internet-based ones. This led many activists to believe that his showing in the primaries might be better than what Gallup polls had been saying. However, in the non-binding Washington, D.C. primary, Kucinich finished fourth (last out of candidates listed on the ballot), with only eight percent of the vote. Support for Kucinich was most prevalent in the caucuses around the country.
In the Iowa caucuses he finished fifth, receiving about one percent of the state delegates from Iowa; far below the 15% threshold for receiving national delegates. He performed similarly in the New Hampshire primary, placing sixth among the seven candidates with 1% of the vote. In the Mini-Tuesday primaries Kucinich finished near the bottom in most states, with his best performance in New Mexico where he received less than six percent of the vote, and still no delegates. Kucinich's best showing in any Democratic contest was in the February 24 Hawaii caucus, in which he won 31% of caucus participants, coming in second place to Senator John Kerry of Massachusetts. He also saw a double-digit showing in Maine on February 8, where he got 16% in that state's caucus.
On Super Tuesday, March 2, Kucinich gained another strong showing with the Minnesota caucus, where 17% of the ballots went to him. In his home state of Ohio, he gained nine percent in the primary.
Kucinich campaigned heavily in Oregon, spending thirty days there during the two months leading up to the state's May 18 primary. He continued his campaign because "the future direction of the Democratic Party has not yet been determined" He won 16% of the vote.
Even after Kerry won enough delegates to secure the nomination, Kucinich continued to campaign up until just before the convention, citing an effort to help shape the agenda of the Democratic party. He was the last candidate to end his campaign, mere days before the start of the convention.
Polls and primaries
Kucinich has criticized the foreign policy of President Bush, including the 2003 invasion of Iraq and what Kucinich perceives to be building American hostility towards Iran. Kucinich and Ron Paul are the only candidates who voted against the 2003 invasion of Iraq. He has since voted against funding it 100% of the time. In 2005, Kucinich voted against the Iran Freedom and Support Act, calling it a "stepping stone to war."
2004 presidential campaign
On December 10, 2003, the American Broadcasting Company (ABC) announced the removal of its correspondents from the campaigns of Kucinich, Carol Moseley Braun and Al Sharpton.
Press coverage
In the 2004 Democratic presidential nomination race, national polls consistently showed Kucinich's support in single digits, but rising, especially as Howard Dean lost some support among peace activists for refusing to commit to cutting the Pentagon budget. Though he was not viewed as a viable contender by most, there were differing polls on Kucinich's popularity.
He placed second in MoveOn.org's primary, behind Dean. He also placed first in other polls, particularly Internet-based ones. This led many activists to believe that his showing in the primaries might be better than what Gallup polls had been saying. However, in the non-binding Washington, D.C. primary, Kucinich finished fourth (last out of candidates listed on the ballot), with only eight percent of the vote. Support for Kucinich was most prevalent in the caucuses around the country.
In the Iowa caucuses he finished fifth, receiving about one percent of the state delegates from Iowa; far below the 15% threshold for receiving national delegates. He performed similarly in the New Hampshire primary, placing sixth among the seven candidates with 1% of the vote. In the Mini-Tuesday primaries Kucinich finished near the bottom in most states, with his best performance in New Mexico where he received less than six percent of the vote, and still no delegates. Kucinich's best showing in any Democratic contest was in the February 24 Hawaii caucus, in which he won 31% of caucus participants, coming in second place to Senator John Kerry of Massachusetts. He also saw a double-digit showing in Maine on February 8, where he got 16% in that state's caucus.
On Super Tuesday, March 2, Kucinich gained another strong showing with the Minnesota caucus, where 17% of the ballots went to him. In his home state of Ohio, he gained nine percent in the primary.
Kucinich campaigned heavily in Oregon, spending thirty days there during the two months leading up to the state's May 18 primary. He continued his campaign because "the future direction of the Democratic Party has not yet been determined" He won 16% of the vote.
Even after Kerry won enough delegates to secure the nomination, Kucinich continued to campaign up until just before the convention, citing an effort to help shape the agenda of the Democratic party. He was the last candidate to end his campaign, mere days before the start of the convention.
Polls and primaries
Main article: Political positions of Dennis Kucinich 2008 Presidential campaign
On 8 January, 2007 Dennis Kucinich unveiled his comprehensive exit plan to bring the troops home and stabilize Iraq.
Announce that the US will end the occupation, close the military bases, and withdraw.
Announce that existing funds will be used to bring the troops and the necessary equipment home.
Order a simultaneous return of all U.S. contractors to the United States and turn over the contracting work to the Iraqi government
Convene a regional conference for the purpose of developing a security and stabilization force for Iraq.
Prepare an international security peacekeeping force to move in, replacing U.S. troops, who then return home.
Develop and fund a process of national reconciliation.
Restart programs for reconstruction and creating jobs for the Iraqi people.
Provide reparations for the damage that has been done to the lives of Iraqis.
Assure the political sovereignty of Iraq and insure that their oil isn't stolen.
Repair the Iraqi economy.
Guarantee economic sovereignty for Iraq
Commence an international truth and reconciliation process, which establishes a policy of truth and reconciliation between the people of the United States and Iraq. The Kucinich Plan For Iraq
Kucinich has always been easily reelected to Congress, though Republicans and conservative Democrats have made increasingly high-profile attempts to challenge him. In the 2004 primary election, Kucinich was renominated for the seat representing Ohio's 10th congressional district.
Democratic party primary election results:
In the general election, the result was:
Kucinich defeated Republican candidate Ed Herman. Because of Kucinich's national fame, both candidates received much backing by their parties from outside the district, particularly on the Internet.
In 2006, Kucinich defeated another Democratic primary challenger by a wide margin, and defeated Republican Mike Dovilla in the general election with 66% of the vote, despite last-minute Republican attempts to bring more support to Dovilla.
Congressional campaigns
In 2003, Kucinich was the recipient of the Gandhi Peace Award, an annual award bestowed by the Religious Society of Friends-affiliated organization Promoting Enduring Peace.
Recognition
Kucinich introduced the first Space Preservation Act on October 2, 2001, with no cosponsors. The bill was referred to the House Science, the House Armed Services, and the House International Relations committees. The bill died in committee (April 9, 2002) because of an unfavorable executive comment received from the Department of Defense.
Space Preservation Act of 2001
On April 17, 2007, Kucinich sent a letter to his Democratic colleagues saying that he planned to file impeachment proceedings against Dick Cheney, the vice president of the United States, without specifying the charges to be brought. Six of these are members of the House Judiciary Committee: Tammy Baldwin, Keith Ellison, Hank Johnson, Maxine Waters, Steve Cohen and Sheila Jackson-Lee.
Impeachment proceedings against Dick Cheney
Kucinich has been a vocal opponent of the H1B and L1 visa programs. In an article on his campaign website, he states:
The expanded use of H-1B and L-1 visas has had a negative effect on the workplace of Information Technology workers in America. It has caused a reduction in wages. It has forced workers to accept deteriorating working conditions and allowed U.S. companies to concentrate work in technical and geographic areas that American workers consider undesirable. It has also reduced the number of IT jobs held by Americans.
Opposition to H1B/L1 Visa Programs
In the aftermath of the Virginia Tech massacre in Blacksburg, Virginia, Kucinich proposed a plan that he says will address violence in America. Kucinich is currently drafting legislation that includes a ban on the purchase, sale, transfer, or possession of handguns by civilians.
Plan to ban handguns
Kucinich is also involved in efforts to bring back the Fairness Doctrine, requiring radio stations to give liberal and conservative points of view equal time, which he and other critics of talk radio claim is not presently the case. He is joined in this effort by fellow Democrat Maurice Hichney, among others, as well as independent Senator Bernie Sanders.
Animal rights
Comparison of 2008 presidential candidates
On 8 January, 2007 Dennis Kucinich unveiled his comprehensive exit plan to bring the troops home and stabilize Iraq.
Announce that the US will end the occupation, close the military bases, and withdraw.
Announce that existing funds will be used to bring the troops and the necessary equipment home.
Order a simultaneous return of all U.S. contractors to the United States and turn over the contracting work to the Iraqi government
Convene a regional conference for the purpose of developing a security and stabilization force for Iraq.
Prepare an international security peacekeeping force to move in, replacing U.S. troops, who then return home.
Develop and fund a process of national reconciliation.
Restart programs for reconstruction and creating jobs for the Iraqi people.
Provide reparations for the damage that has been done to the lives of Iraqis.
Assure the political sovereignty of Iraq and insure that their oil isn't stolen.
Repair the Iraqi economy.
Guarantee economic sovereignty for Iraq
Commence an international truth and reconciliation process, which establishes a policy of truth and reconciliation between the people of the United States and Iraq. The Kucinich Plan For Iraq
Kucinich has always been easily reelected to Congress, though Republicans and conservative Democrats have made increasingly high-profile attempts to challenge him. In the 2004 primary election, Kucinich was renominated for the seat representing Ohio's 10th congressional district.
Democratic party primary election results:
In the general election, the result was:
Kucinich defeated Republican candidate Ed Herman. Because of Kucinich's national fame, both candidates received much backing by their parties from outside the district, particularly on the Internet.
In 2006, Kucinich defeated another Democratic primary challenger by a wide margin, and defeated Republican Mike Dovilla in the general election with 66% of the vote, despite last-minute Republican attempts to bring more support to Dovilla.
Congressional campaigns
In 2003, Kucinich was the recipient of the Gandhi Peace Award, an annual award bestowed by the Religious Society of Friends-affiliated organization Promoting Enduring Peace.
Recognition
Kucinich introduced the first Space Preservation Act on October 2, 2001, with no cosponsors. The bill was referred to the House Science, the House Armed Services, and the House International Relations committees. The bill died in committee (April 9, 2002) because of an unfavorable executive comment received from the Department of Defense.
Space Preservation Act of 2001
On April 17, 2007, Kucinich sent a letter to his Democratic colleagues saying that he planned to file impeachment proceedings against Dick Cheney, the vice president of the United States, without specifying the charges to be brought. Six of these are members of the House Judiciary Committee: Tammy Baldwin, Keith Ellison, Hank Johnson, Maxine Waters, Steve Cohen and Sheila Jackson-Lee.
Impeachment proceedings against Dick Cheney
Kucinich has been a vocal opponent of the H1B and L1 visa programs. In an article on his campaign website, he states:
The expanded use of H-1B and L-1 visas has had a negative effect on the workplace of Information Technology workers in America. It has caused a reduction in wages. It has forced workers to accept deteriorating working conditions and allowed U.S. companies to concentrate work in technical and geographic areas that American workers consider undesirable. It has also reduced the number of IT jobs held by Americans.
Opposition to H1B/L1 Visa Programs
In the aftermath of the Virginia Tech massacre in Blacksburg, Virginia, Kucinich proposed a plan that he says will address violence in America. Kucinich is currently drafting legislation that includes a ban on the purchase, sale, transfer, or possession of handguns by civilians.
Plan to ban handguns
Kucinich is also involved in efforts to bring back the Fairness Doctrine, requiring radio stations to give liberal and conservative points of view equal time, which he and other critics of talk radio claim is not presently the case. He is joined in this effort by fellow Democrat Maurice Hichney, among others, as well as independent Senator Bernie Sanders.
Animal rights
Comparison of 2008 presidential candidates
Sunday, August 19, 2007
A web crawler (also known as a Web spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000).
This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).
A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
Crawling policies
Given the current size of the Web, even large search engines cover only a portion of the publicly available internet; a study by Lawrence and Giles (Lawrence and Giles, 2000) showed that no search engine indexes more than 16% of the Web. As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages, and not just a random sample of the Web.
This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the latter is the case of vertical search engines restricted to a single top-level domain, or search engines restricted to a fixed Web site). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling.
Cho et al. (Cho et al., 1998) made the first study on policies for crawling scheduling. Their data set was a 180,000-pages crawl from the stanford.edu domain, in which a crawling simulation was done with different strategies. The ordering metrics tested were breadth-first, backlink-count and partial Pagerank calculations. One of the conclusions was that if the crawler wants to download pages with high Pagerank early during the crawling process, then the partial Pagerank strategy is the better, followed by breadth-first and backlink-count. However, these results are for just a single domain.
Najork and Wiener (Najork and Wiener, 2001) performed an actual crawl on 328 million pages, using breadth-first ordering. They found that a breadth-first crawl captures pages with high Pagerank early in the crawl (but they did not compare this strategy against other strategies). The explanation given by the authors for this result is that "the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates".
Abiteboul (Abitebout et al., 2003) designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation). In OPIC, each page is given an initial sum of "cash" which is distributed equally among the pages it points to. It is similar to a Pagerank computation, but it is faster and is only done in one step. An OPIC-driven crawler downloads first the pages in the crawling frontier with higher amounts of "cash". Experiments were carried in a 100,000-pages synthetic graph with a power-law distribution of in-links. However, there was no comparison with other strategies nor experiments in the real Web.
Boldi et al. (Boldi et al., 2004) used simulation on subsets of the Web of 40 million pages from the .it domain and 100 million pages from the WebBase crawl, testing breadth-first against depth-first, random ordering and an omniscient strategy. The comparison was based on how well PageRank computed on a partial crawl approximates the true PageRank value. Surprisingly, some visits that accumulate PageRank very quickly (most notably, breadth-first and the omniscent visit) provide very poor progressive approximations.
Baeza-Yates et al. (Baeza-Yates et al., 2005) used simulation on two subsets of the Web of 3 million pages from the .gr and .cl domain, testing several crawling strategies. They showed that both the OPIC strategy and a strategy that uses the length of the per-site queues are both better than breadth-first crawling, and that it is also very effective to use a previous crawl, when it is available, to guide the current one.
Selection policy
A crawler may only want to seek out HTML pages and avoid all other MIME types. In order to request only HTML resources, a crawler may make an HTTP HEAD request to determine a Web resource's MIME type before requesting the entire resource with a GET request. To avoid making numerous HEAD requests, a crawler may alternatively examine the URL and only request the resource if the URL ends with .html, .htm or a slash. This strategy may cause numerous HTML Web resources to be unintentionally skipped. A similar strategy compares the extension of the web resource to a list of known HTML-page types: .html, .htm, .asp, .aspx, .php, and a slash.
Some crawlers may also avoid requesting any resources that have a "?" in them (are dynamically produced) in order to avoid spider traps which may cause the crawler to download an infinite number of URLs from a Web site.
Restricting followed links
Some crawlers intend to download as many resources as possible from a particular Web site. Cothey (Cothey, 2004) introduced a path-ascending crawler that would ascend to every path in each URL that it intends to crawl. For example, when given a seed URL of http://llama.org/hamster/monkey/page.html, it will attempt to crawl /hamster/monkey/, /hamster/, and /. Cothey found that a path-ascending crawler was very effective in finding isolated resources, or resources for which no inbound link would have been found in regular crawling.
Many Path-ascending crawlers are also known as Harvester software, because they're used to "harvest" or collect all the content - perhaps the collection of photos in a gallery - from a specific page or host.
Path-ascending crawling
Main article: Focused crawler Focused crawling
A vast amount of Web pages lie in the deep or invisible Web. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google's Sitemap Protocol and mod oai (Nelson et al., 2005) are intended to allow discovery of these deep-Web resources.
Crawling the Deep Web
The Web has a very dynamic nature, and crawling a fraction of the Web can take a really long time, usually measured in weeks or months. By the time a Web crawler has finished its crawl, many events could have happened. These events can include creations, updates and deletions.
From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most used cost functions, introduced in (Cho and Garcia-Molina, 2000), are freshness and age.
Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as:
Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as:
Coffman et al. (Edward G. Coffman, 1998) worked with a definition of the objective of a web crawler that is equivalent to freshness, but use a different wording: they propose that a crawler must minimize the fraction of time pages remain outdated. They also noted that the problem of web crawling can be modeled as a multiple-queue, single-server polling system, on which the Web crawler is the server and the Web sites are the queues. Page modifications are the arrival of the customers, and switch-over times are the interval between page accesses to a single Web site. Under this model, mean waiting time for a customer in the polling system is equivalent to the average age for the Web crawler.
The objective of the crawler is to keep the average freshness of pages in its collection as high as possible, or to keep the average age of pages as low as possible. These objectives are not equivalent: in the first case, the crawler is just concerned with how many pages are out-dated, while in the second case, the crawler is concerned with how old the local copies of pages are.
Two simple re-visiting policies were studied by Cho and Garcia-Molina (Cho and Garcia-Molina, 2003):
Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change.
Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency.
(In both cases, the repeated crawling order of pages can be done either at random or with a fixed order.)
Cho and Garcia-Molina proved the surprising result that, in terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. The explanation for this result comes from the fact that, when a page changes too often, the crawler will waste time by trying to re-crawl it too fast and still will not be able to keep its copy of the page fresh.
To improve freshness, we should penalize the elements that change too often (Cho and Garcia-Molina, 2003a). The optimal re-visiting policy is neither the uniform policy nor the proportional policy. The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. In both cases, the optimal is closer to the uniform policy than to the proportional policy: as Coffman et al. (Edward G. Coffman, 1998) note, "in order to minimize the expected obsolescence time, the accesses to any particular page should be kept as evenly spaced as possible". Explicit formulas for the re-visit policy are not attainable in general, but they are obtained numerically, as they depend on the distribution of page changes. (Cho and Garcia-Molina, 2003a) show that the exponential distribution is a good fit for describing page changes, while (Ipeirotis et al., 2005) show how to use statistical tools to discover parameters that affect this distribution. Note that the re-visiting policies considered here regard all pages as homogeneous in terms of quality ("all pages on the Web are worth the same"), something that is not a realistic scenario, so further information about the Web page quality should be included to achieve a better crawling policy.
Re-visit policy
Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers.
As noted by Koster (Koster, 1995), the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include:
A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol (Koster, 1996) that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search engines like Ask Jeeves, MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the robots.txt file to indicate the number of seconds to delay between requests.
The first proposal for the interval between connections was given in (Koster, 1993) and was 60 seconds. However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire website; also, only a fraction of the resources from that Web server would be used. This does not seem acceptable.
Cho (Cho and Garcia-Molina, 2003) uses 10 seconds as an interval for accesses, and the WIRE crawler (Baeza-Yates and Castillo, 2002) uses 15 seconds as the default. The MercatorWeb crawler (Heydon and Najork, 1999) follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page. Dill et al. (Dill et al., 2002) use 1 second.
Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes. It is worth noticing that even when being very polite, and taking all the safeguards to avoid overloading Web servers, some complaints from Web server administrators are received. Brin and Page note that: "... running a crawler which connects to more than half a million servers (...) generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen." (Brin and Page, 1998).
Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time.
Server overload, especially if the frequency of accesses to a given server is too high.
Poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle.
Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers. Politeness policy
A vast amount of Web pages lie in the deep or invisible Web. These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google's Sitemap Protocol and mod oai (Nelson et al., 2005) are intended to allow discovery of these deep-Web resources.
Crawling the Deep Web
The Web has a very dynamic nature, and crawling a fraction of the Web can take a really long time, usually measured in weeks or months. By the time a Web crawler has finished its crawl, many events could have happened. These events can include creations, updates and deletions.
From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most used cost functions, introduced in (Cho and Garcia-Molina, 2000), are freshness and age.
Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as:
Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as:
Coffman et al. (Edward G. Coffman, 1998) worked with a definition of the objective of a web crawler that is equivalent to freshness, but use a different wording: they propose that a crawler must minimize the fraction of time pages remain outdated. They also noted that the problem of web crawling can be modeled as a multiple-queue, single-server polling system, on which the Web crawler is the server and the Web sites are the queues. Page modifications are the arrival of the customers, and switch-over times are the interval between page accesses to a single Web site. Under this model, mean waiting time for a customer in the polling system is equivalent to the average age for the Web crawler.
The objective of the crawler is to keep the average freshness of pages in its collection as high as possible, or to keep the average age of pages as low as possible. These objectives are not equivalent: in the first case, the crawler is just concerned with how many pages are out-dated, while in the second case, the crawler is concerned with how old the local copies of pages are.
Two simple re-visiting policies were studied by Cho and Garcia-Molina (Cho and Garcia-Molina, 2003):
Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change.
Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency.
(In both cases, the repeated crawling order of pages can be done either at random or with a fixed order.)
Cho and Garcia-Molina proved the surprising result that, in terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. The explanation for this result comes from the fact that, when a page changes too often, the crawler will waste time by trying to re-crawl it too fast and still will not be able to keep its copy of the page fresh.
To improve freshness, we should penalize the elements that change too often (Cho and Garcia-Molina, 2003a). The optimal re-visiting policy is neither the uniform policy nor the proportional policy. The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. In both cases, the optimal is closer to the uniform policy than to the proportional policy: as Coffman et al. (Edward G. Coffman, 1998) note, "in order to minimize the expected obsolescence time, the accesses to any particular page should be kept as evenly spaced as possible". Explicit formulas for the re-visit policy are not attainable in general, but they are obtained numerically, as they depend on the distribution of page changes. (Cho and Garcia-Molina, 2003a) show that the exponential distribution is a good fit for describing page changes, while (Ipeirotis et al., 2005) show how to use statistical tools to discover parameters that affect this distribution. Note that the re-visiting policies considered here regard all pages as homogeneous in terms of quality ("all pages on the Web are worth the same"), something that is not a realistic scenario, so further information about the Web page quality should be included to achieve a better crawling policy.
Re-visit policy
Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers.
As noted by Koster (Koster, 1995), the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include:
A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol (Koster, 1996) that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search engines like Ask Jeeves, MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the robots.txt file to indicate the number of seconds to delay between requests.
The first proposal for the interval between connections was given in (Koster, 1993) and was 60 seconds. However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire website; also, only a fraction of the resources from that Web server would be used. This does not seem acceptable.
Cho (Cho and Garcia-Molina, 2003) uses 10 seconds as an interval for accesses, and the WIRE crawler (Baeza-Yates and Castillo, 2002) uses 15 seconds as the default. The MercatorWeb crawler (Heydon and Najork, 1999) follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page. Dill et al. (Dill et al., 2002) use 1 second.
Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes. It is worth noticing that even when being very polite, and taking all the safeguards to avoid overloading Web servers, some complaints from Web server administrators are received. Brin and Page note that: "... running a crawler which connects to more than half a million servers (...) generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen." (Brin and Page, 1998).
Network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time.
Server overload, especially if the frequency of accesses to a given server is too high.
Poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle.
Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers. Politeness policy
Main article: Distributed web crawling Parallelization policy
A crawler must not only have a good crawling strategy, as noted in the previous sections, but it should also have a highly optimized architecture.
Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: "While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability."
Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. When crawler designs are published, there is often an important lack of detail that prevents others from reproducing the work. There are also emerging concerns about "search engine spamming", which prevent major search engines from publishing their ranking algorithms.
URL normalization
Web crawlers typically identify themselves to a Web server by using the User-agent field of an HTTP request. Web site administrators typically examine their web servers' log and use the user agent field to determine which crawlers have visited the Web server and how often. The user agent field may include a URL where the Web site administrator may find out more information about the crawler. Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler.
It is important for Web crawlers to identify themselves so Web site administrators can contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or they may be overloading a Web server with requests, and the owner needs to stop the crawler. Identification is also useful for administrators that are interested in knowing when they may expect their Web pages to be indexed by a particular search engine.
Crawler identification
The following is a list of published crawler architectures for general-purpose crawlers (excluding focused Web crawlers), with a brief description that includes the names given to the different components and outstanding features:
RBSE (Eichmann, 1994) was the first published web crawler. It was based on two programs: the first program, "spider" maintains a queue in a relational database, and the second program "mite", is a modified www ASCII browser that downloads the pages from the Web.
WebCrawler (Pinkerton, 1994) was used to build the first publicly-available full-text index of a subset of the Web. It was based on lib-WWW to download pages, and another program to parse and order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with the provided query.
World Wide Web Worm (McBryan, 1994) was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the grep Unix command.
Google Crawler (Brin and Page, 1998) is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.
CobWeb (da Silva et al., 1999) uses a central "scheduler" and a series of distributed "collectors". The collectors parse the downloaded Web pages and send the discovered URLs to the scheduler, which in turn assign them to the collectors. The scheduler enforces a breadth-first search order with a politeness policy to avoid overloading Web servers. The crawler is written in Perl.
Mercator (Heydon and Najork, 1999; Najork and Heydon, 2001) is a distributed, modular web crawler written in Java. Its modularity arises from the usage of interchangeable "protocol modules" and "processing modules". Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and processing modules are related to how to process Web pages. The standard processing module just parses the pages and extract new URLs, but other processing modules can be used to index the text of the pages, or to gather statistics from the Web.
WebFountain (Edwards et al., 2001) is a distributed, modular crawler similar to Mercator but written in C++. It features a "controller" machine that coordinates a series of "ant" machines. After repeatedly downloading pages, a change rate is inferred for each page and a non-linear programming method must be used to solve the equation system for maximizing freshness. The authors recommend to use this crawling order in the early stages of the crawl, and then switch to a uniform crawling order, in which all pages are being visited with the same frequency.
PolyBot [Shkapenyuk and Suel, 2002] is a distributed crawler written in C++ and Python, which is composed of a "crawl manager", one or more "downloaders" and one or more "DNS resolvers". Collected URLs are added to a queue on disk, and processed later to search for seen URLs in batch mode. The politeness policy considers both third and second level domains (e.g.: www.example.com and www2.example.com are third level domains) because third level domains are usually hosted by the same Web server.
WebRACE (Zeinalipour-Yazti and Dikaiakos, 2002) is a crawling and caching module implemented in Java, and used as a part of a more generic system called eRACE. The system receives requests from users for downloading Web pages, so the crawler acts in part as a smart proxy server. The system also handles requests for "subscriptions" to Web pages that must be monitored: when the pages change, they must be downloaded by the crawler and the subscriber must be notified. The most outstanding feature of WebRACE is that, while most crawlers start with a set of "seed" URLs, WebRACE is continuously receiving new starting URLs to crawl from.
Ubicrawler (Boldi et al., 2004) is a distributed crawler written in Java, and it has no central process. It is composed of a number of identical "agents"; and the assignment function is calculated using consistent hashing of the host names. There is zero overlap, meaning that no page is crawled twice, unless a crawling agent crashes (then, another agent must re-crawl the pages from the failing agent). The crawler is designed to achieve high scalability and to be tolerant to failures.
FAST Crawler (Risvik and Michelsen, 2002) is the crawler used by the FAST search engine, and a general description of its architecture is available. It is a distributed architecture in which each machine holds a "document scheduler" that maintains a queue of documents to be downloaded by a "document processor" that stores them in a local storage subsystem. Each crawler communicates with the other crawlers via a "distributor" module that exchanges hyperlink information.
Labrador is a closed-source web crawler that works with the Open Source project Terrier search engine
Spinn3r Is a crawler used to build Tailrank. Spinn3r is based on Java and the majority of it's architecture is Open Source. Spinn3r is mostly oriented around crawling the blogosphere.
In addition to the specific crawler architectures listed above, there are general crawler architectures published by Cho (Cho and Garcia-Molina, 2002) and Chakrabarti (Chakrabarti, 2003).
HotCrawler HotCrawler is a crawler written in C, and PHP. HotCrawler crawls websites by visiting a list of URLs listed in it's database, and it adds new URLs to it's queue as it find them, and it's separated from the search engine. If the URL is already crawled through the queue session, it adds it to the last queue session created. It's kind of two separated programs, a one that downloads pages and saves copies of it in a database, and another program that determine the next time to visit a page, based on many factors.
Open-source crawlers
Distributed web crawling
Focused crawler
Internet Archive
Library of Congress Digital Library project
National Digital Information Infrastructure and Preservation Program
PageRank
Spambot
Spider trap
Spidering Hacks - an O'Reilly book focused on spider-like programming
Search Engine Indexing - the step after crawling
Web archiving
A crawler must not only have a good crawling strategy, as noted in the previous sections, but it should also have a highly optimized architecture.
Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: "While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability."
Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. When crawler designs are published, there is often an important lack of detail that prevents others from reproducing the work. There are also emerging concerns about "search engine spamming", which prevent major search engines from publishing their ranking algorithms.
URL normalization
Web crawlers typically identify themselves to a Web server by using the User-agent field of an HTTP request. Web site administrators typically examine their web servers' log and use the user agent field to determine which crawlers have visited the Web server and how often. The user agent field may include a URL where the Web site administrator may find out more information about the crawler. Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler.
It is important for Web crawlers to identify themselves so Web site administrators can contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or they may be overloading a Web server with requests, and the owner needs to stop the crawler. Identification is also useful for administrators that are interested in knowing when they may expect their Web pages to be indexed by a particular search engine.
Crawler identification
The following is a list of published crawler architectures for general-purpose crawlers (excluding focused Web crawlers), with a brief description that includes the names given to the different components and outstanding features:
RBSE (Eichmann, 1994) was the first published web crawler. It was based on two programs: the first program, "spider" maintains a queue in a relational database, and the second program "mite", is a modified www ASCII browser that downloads the pages from the Web.
WebCrawler (Pinkerton, 1994) was used to build the first publicly-available full-text index of a subset of the Web. It was based on lib-WWW to download pages, and another program to parse and order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with the provided query.
World Wide Web Worm (McBryan, 1994) was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the grep Unix command.
Google Crawler (Brin and Page, 1998) is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.
CobWeb (da Silva et al., 1999) uses a central "scheduler" and a series of distributed "collectors". The collectors parse the downloaded Web pages and send the discovered URLs to the scheduler, which in turn assign them to the collectors. The scheduler enforces a breadth-first search order with a politeness policy to avoid overloading Web servers. The crawler is written in Perl.
Mercator (Heydon and Najork, 1999; Najork and Heydon, 2001) is a distributed, modular web crawler written in Java. Its modularity arises from the usage of interchangeable "protocol modules" and "processing modules". Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and processing modules are related to how to process Web pages. The standard processing module just parses the pages and extract new URLs, but other processing modules can be used to index the text of the pages, or to gather statistics from the Web.
WebFountain (Edwards et al., 2001) is a distributed, modular crawler similar to Mercator but written in C++. It features a "controller" machine that coordinates a series of "ant" machines. After repeatedly downloading pages, a change rate is inferred for each page and a non-linear programming method must be used to solve the equation system for maximizing freshness. The authors recommend to use this crawling order in the early stages of the crawl, and then switch to a uniform crawling order, in which all pages are being visited with the same frequency.
PolyBot [Shkapenyuk and Suel, 2002] is a distributed crawler written in C++ and Python, which is composed of a "crawl manager", one or more "downloaders" and one or more "DNS resolvers". Collected URLs are added to a queue on disk, and processed later to search for seen URLs in batch mode. The politeness policy considers both third and second level domains (e.g.: www.example.com and www2.example.com are third level domains) because third level domains are usually hosted by the same Web server.
WebRACE (Zeinalipour-Yazti and Dikaiakos, 2002) is a crawling and caching module implemented in Java, and used as a part of a more generic system called eRACE. The system receives requests from users for downloading Web pages, so the crawler acts in part as a smart proxy server. The system also handles requests for "subscriptions" to Web pages that must be monitored: when the pages change, they must be downloaded by the crawler and the subscriber must be notified. The most outstanding feature of WebRACE is that, while most crawlers start with a set of "seed" URLs, WebRACE is continuously receiving new starting URLs to crawl from.
Ubicrawler (Boldi et al., 2004) is a distributed crawler written in Java, and it has no central process. It is composed of a number of identical "agents"; and the assignment function is calculated using consistent hashing of the host names. There is zero overlap, meaning that no page is crawled twice, unless a crawling agent crashes (then, another agent must re-crawl the pages from the failing agent). The crawler is designed to achieve high scalability and to be tolerant to failures.
FAST Crawler (Risvik and Michelsen, 2002) is the crawler used by the FAST search engine, and a general description of its architecture is available. It is a distributed architecture in which each machine holds a "document scheduler" that maintains a queue of documents to be downloaded by a "document processor" that stores them in a local storage subsystem. Each crawler communicates with the other crawlers via a "distributor" module that exchanges hyperlink information.
Labrador is a closed-source web crawler that works with the Open Source project Terrier search engine
Spinn3r Is a crawler used to build Tailrank. Spinn3r is based on Java and the majority of it's architecture is Open Source. Spinn3r is mostly oriented around crawling the blogosphere.
In addition to the specific crawler architectures listed above, there are general crawler architectures published by Cho (Cho and Garcia-Molina, 2002) and Chakrabarti (Chakrabarti, 2003).
HotCrawler HotCrawler is a crawler written in C, and PHP. HotCrawler crawls websites by visiting a list of URLs listed in it's database, and it adds new URLs to it's queue as it find them, and it's separated from the search engine. If the URL is already crawled through the queue session, it adds it to the last queue session created. It's kind of two separated programs, a one that downloads pages and saves copies of it in a database, and another program that determine the next time to visit a page, based on many factors.
Open-source crawlers
Distributed web crawling
Focused crawler
Internet Archive
Library of Congress Digital Library project
National Digital Information Infrastructure and Preservation Program
PageRank
Spambot
Spider trap
Spidering Hacks - an O'Reilly book focused on spider-like programming
Search Engine Indexing - the step after crawling
Web archiving
Subscribe to:
Posts (Atom)
Blog Archive
-
▼
2007
(104)
-
▼
August
(14)
- Modern Toss is a British series of cartoon bookl...
- History Mark VII Monorail - Disneyland - 2008-2...
- General Sir (Frederick) Stanley Maude KCB (1916)...
- Otherware, sometimes called requestware, is a ...
- The pound was the currency of Rhode Island unt...
- Until 966 966–1385 1385–1569 1569–1795 1795–1918...
- English law, the legal system of England and Wal...
- Florida Atlantic University Florida Atlantic Uni...
- Yeongdeungpo-gu is an administrative district ...
- Dennis John Kucinich (born October 8, 1946) is a...
- A web crawler (also known as a Web spider or W...
- Michael Ellis DeBakey (born Michel Dabaghi on S...
- University of Teesside The University of Teessid...
- Toymaker's suicide sharpens focus on China quality...
-
▼
August
(14)