Archive for October, 2009

Web Hosting Martinsburg Wv

Thursday, October 15th, 2009
class c ip
Shawn Burgy asked:


Web Hosting Martinsburg WV :

Here are some possibilities you may want to consider when selecting a discounted Web hosting company.

There are three types of hosting plans available:



Shared hosting



Virtual Private Server hosting



Dedicated Server hosting.

What Is A Web Host?

Essentially a web host is somebody who provides a place to store the files (pages) of your web site and makes them visible to the internet at large.

People can view these pages using a web browser, read the content, download files and generally interact in all the ways you have probably done on the internet up until now.

When you purchase a web hosting account you are literally buying disk space on the hosting company’s computer in which you store your web site.

Your site will consist of pages containing text, images or graphics that you use on those pages and any files that you might make available for download such as videos or ebooks.

You control which pages are visible and what content is on them by creating these pages in an HTML editor. This is special software that creates web pages and is, to all intents and purposes, nothing more than a word processor with extra functionality built in.

With a shared hosting plans, several web sites are hosted on the same server, sharing the server’s resources and using the same IP address. Virtual Private Server (VPS) plans consist of a server that is split into multiple virtual servers, each virtual server has it’s own IP address, some companies call these types of plans Virtual Dedicated Servers. Dedicated servers are the most expensive type of plan, each dedicated server customer gets their own physical server, nice to have, but prohibitively expensive for personal web sites and small operations.

Features Of Web Hosting For The Beginners

If you are just starting on the Internet and wish to have a good website to display your skills as a writer, or as a marketer, then you can use the web hosting for beginners. The features of web hosting for the beginners are essentially very basic so that any newcomer can easily get accustomed to these basic features quickly. As the time passes, and the newcomer becomes skilled in using these features, he can then use the intermediate and the advanced features and create a better website. This article gives you an idea of how to use these basic features provided by your web hosting service for novices.

However, before you know more about the web hosting and the web development in detail, you need to know some very important features of the internet like the World Wide Web. World Wide Web is a network of huge number of computers that are around the world that are connected to each other for communications purpose with the help of protocols like HTTP. HTTP (Hyper Text Transfer Protocol) is a language that allows transmission of the documents present on the Internet.

Why Web Hosting Is Important?

The web hosting server allows you to host your website and make it available on the Internet for the whole world to access it. This way you can advertise your service and products on your website. Some of the other essential web services like the e-mail capability, database capability, and uploading dynamic content are essential to really tap the power of websites.

The e-mail capability allows you to receive and dispatch e-mails and information to be sent to your subscribers directly from your website. The database capability allows you to store a large amount of useful information on your website. The dynamic content is the content that allows you and your visitors to interact with each other.

VPS hosting plans tend to be somewhat more expensive than shared hosting plans, but it is our belief that they are worth the extra cost since they provide much more control and flexibility. If you are a Java developer, chances are you are used to “getting your hands dirty”, and working on a server using good old Unix commands. Shared hosting plans tend to have “user friendly” (dumbed down?) interfaces, which might simplify administration, but can also severely limit what you are able to do, for example, let’s say a shared hosting company gives you 300 megabytes of disk space to host your web site, and an additional 300 megabytes for your email, if your web site takes 5 megabytes of space, but your email server is getting full, there is no way to allocate more space to store emails and reduce the allocation of web space. In addition to leaving you unable to reallocate resources as needed, you can also forget about installing any applications on your server. Another disadvantage of shared hosting plans is that an IP address is shared among several customers, which could have potential problems. For example, if one of the customers uses their mail server for bulk emailing, the IP address of that mail server may be banned from several systems, in a shared hosting plan environment, this would affect all the customers using the same server.

With few exceptions, shared hosting plans that support Java do so through a shared JVM, which means that you have no way of starting or stopping the JVM, and the same JVM is used to run the Java applications of all the hosting company’s clients on the server. With a VPS plan, since you have access to your own (virtual) server, it is a given that you get full control over the JVM.

For all of these reasons I recommend the Web Hosting Providers in my links.

Price, Value, And most of all Customer Service.

Web Hosting Martinsburg WV



Lewis

Search Engines vs. SEO Spam: Statistical Methods

Sunday, October 11th, 2009
seo hosting
Oleg Ishenko asked:


High placement in a search engine is critical for the success of any online business. Pages appearing higher in the search engine results to queries relevant to a site’s business will get higher targeted traffic. To get this kind of competitive advantage Internet companies employ various SEO techniques in order to optimize certain factors used by search engines to rank results.

In the best case SEO specialists create relevant well-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Unfortunately it takes months for this strategic approach to produce feasible results, and many search engine optimizers use so-called “black-hat” SEO.

‘Black Hat’ SEO and Search Engine Spam

The oldest and simplest “black SEO” strategy is adding a variety of popular keywords into web pages to make them rank high for popular queries. This behavior is easily detected since generally such pages include unrelated keywords that lack topical focus. With the introduction of the term vector analysis search engine became immune to this sort of manipulation. However “black-hat’ SEO went one step further creating the so-called “doorway’ pages - tightly focused pages consisting of a bunch of keywords relevant to a single topic. In terms of keyword density such pages are able to rank high in search results but never seen by human visitors as they are redirected to the page intended to receive the traffic.

Another trend is the abusing the link popularity based ranking algorithms, such as PageRank with the help of dynamically-generated pages. Such pages receive the minimum guaranteed PageRank and the small endorsements from thousands of these pages are able to produce a sizeable PageRank for the target page. Search engines constantly improve their algorithms trying to minimize the effect of “black-hat”‘ SEO techniques, but SEOs also persistently respond with new more sophisticated and technically advanced tricks so that this process bears a resemblance to an arms race.

“Black-hat” SEO is responsible for the immense amount of search engine spam-pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.

Using Statistics to Detect Search Engine Spam

An example of an application of statistical methods to detect web spam is presented in the paper “Spam, Damn Spam and Statistics” by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.

Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects - the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).

The research concentrates on studying the following properties of web pages: - URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.). - Host name resolutions. - Linkage properties. - Content properties. - Content evolution properties. - Clustering properties.

URL Properties

Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.

The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits-and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.

Host Name Resolutions

One can notice that Google, given a query q, tends to rank a page higher if the host component of the page’s URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.

This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs-to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.

To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.

Linkage Properties

The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.

In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.

Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.

Content Properties

Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.

For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).

Content Evolution

The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.

The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.

Clustering Properties

Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.

To form clusters of similar pages the ’shingling’ algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on near duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how many such clusters Set 1 contains.

The outliers can be put into two groups. The first group did not contain any spam pages, pages in this group are more related to the duplicated content issue. In the same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters were spam containing 2,080,112 pages (1.38% of all pages in Set 1)

To Sum Up

The methods described above are the examples of a fairly simple statistical approach to spam detection. The real life algorithms are much more sophisticated and are based on machine learning technologies which allow search engine to detect and battle spam with a relatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to produce more relevant results and ensures a more fair competition based on the quality of web resources and not on technical tricks.

References:

1. Dennis Fetterly, Mark Manasse, Marc Najork. “Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages” (2004). Microsoft Research.

2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. “Syntactic Clustering of the Web”. In 6th International World Wide Web Conference, April 1997.



Natalie