0
votes

I have a sitecore multisite setup.

i'm currently struggling with the "duplicate content syndrome" were google bots indexes my sites and is able to access the content of the opposite site.

this means it finds the same content on 2 different hostNames which gives the sites a lower rating in a google search.

The reason it finds duplicate content is that i am able to access a child node on the oppsosite site than the one i'm currently browsing by typing the name in the URL.

This is my web.config setup of the sites:

<site name="website2" hostName="local.domain.dk" virtualFolder="/" >physicalFolder="/" rootPath="/sitecore/content/talk" startItem="/" database="web" domain="extranet" allowDebug="true" cacheHtml="true" htmlCacheSize="10MB" registryCacheSize="0" viewStateCacheSize="0" xslCacheSize="5MB" filteredItemsCacheSize="2MB" enablePreview="true" enableWebEdit="true" enableDebugger="true" disableClientData="false"/>

<site name="website" virtualFolder="/" physicalFolder="/" >rootPath="/sitecore/content/home" startItem="/" database="web" domain="extranet" allowDebug="true" cacheHtml="true" htmlCacheSize="10MB" registryCacheSize="0" viewStateCacheSize="0" xslCacheSize="5MB" filteredItemsCacheSize="2MB" enablePreview="true" enableWebEdit="true" enableDebugger="true" disableClientData="false"/>

Even though i set the rootpath to the root of each site, i am still able to access the child node of local.domain.dk/ydelser/integration by typing local.domain-talk/integration.

Any help would be much appreciated !

5

5 Answers

1
votes

You need to make sure you have set the hostName and targetHostName attribute in your <site> configuration. This will ensure when you link to content between sites the link will render out the full URL including hostname.

hostName: The host name of the incoming url. May include wildcards (ex. www.site.net, *.site.net, *.net, pda.*, print.*.net)
          It's possible to set more than one mask by using '|' symbol as a separator (ex. pda.*|print.*.net)
targetHostName: The host name to use when generating URLs to items within this site from the context of another site.
          If the targetHostName attribute is absent, Sitecore uses the value of the hostName attribute instead.
          Used only when the value of the Rendering.SiteResolving setting is true.

And make sure Rendering.SiteResolving=true

  <!--  SITE RESOLVING
        While rendering item links, some items may belong to different site. Setting this to true
        make LinkManager try to resolve target site in order to use the right host name.
        Default value: true
  -->
  <setting name="Rendering.SiteResolving" value="true" />

You will always be able to access a page with the full path, so as Jens says add in canonical link tags. Once you've resolved the cross site linking and canonical links issue then the google bots should oly be following clean links.

0
votes

It seems you miss the hostname attribute in the configuration of your "website" node. If you have 2 websites you also need 2 website nodes for it with a corresponding hostname.

You don't use any custom item resolver in the pipelines? That could cause this as well

0
votes

Sitecore has a number of issues with multi-site link generation, some of which have been addressed in the latest release of 6.6: http://sdn.sitecore.net/Products/Sitecore%20V5/Sitecore%20CMS%206/ReleaseNotes/ChangeLog/Release%20History%20SC66.aspx#660update6 (look for the section on changes to the Link Provider).

It is also reasonably simple to add a few extra safeguards against cross site noise such as this. You could add a step after the ItemResolver in the httpRequestBegin pipeline along something like this (sorry, bit pressed for time to write up a compileable example, this should give the idea though):

Item siteRoot = Sitecore.Context.Site.StartItem;
if (!(Sitecore.Context.Item.ID == siteRoot.ID || Sitecore.Context.Item.Axes.IsDescendantOf(siteRoot))
  // break and do 404
0
votes

The way that Sitecore resolves items makes it possible to access pages in multiple sites and with multiple domains.

If you have the following structure:

-sitecore
--content
---site1
----site1page1
---site2
----site2page1

And site1 has the domain site1.com and site2 has site2.com, you can always address an item with its full path. So for instance you can access site2's pages on site1 like this:

site1.com/sitecore/content/site2/site2page1.aspx

There are multiple ways of handling this in regards to SEO, but the simplest is using canonical links in the meta data, so that Google doesn't consider this as duplicate content. You can then add logic to render a meta tag with the url you want on all pages.

If you don't want to allow pages from one site to be shown on another site, you should create different domains for each site, and then use Sitecore security to disallow read access from one site to another. For instance you could create site1 as a domain and then restrict read access on site2 items in that domain.

0
votes

I agree that the standard ItemResolver is too forgiving with the URLs. Not only can you get the same page in any site, but you can also get duplicates by using the full Sitecore path (e.g. /sitecore/content/Site/page). On one project, where this was a big issue for the client, I created a custom ItemResolver that would be more strict. Here it is:

public class ItemResolver : Sitecore.Pipelines.HttpRequest.ItemResolver
{
    public override void Process(HttpRequestArgs args)
    {
        Assert.ArgumentNotNull(args, "args");
        if (((Context.Item == null) && (Context.Database != null)) && (args.Url.ItemPath.Length != 0))
        {
            if (Context.Domain.Name.ToLower() == "sitecore")
            {
                base.Process(args);
                return;
            }

            Profiler.StartOperation("Resolve current item.");
            string path = MainUtil.DecodeName(args.Url.ItemPath);
            Item item = args.GetItem(path);
            if (item != null)
            {
                Tracer.Info("Current item is \"" + path + "\".");
            }
            Context.Item = item;
            Profiler.EndOperation();
        }
    }
}

If you compare this to the decompiled standard ItemResolver, you will see that it is the same code used for the first step. It just doesn't attempt to find the item using other means if that first step fails. Another nice benefit of this is that it runs a bit faster than the standard ItemResolver.