Friday, February 27, 2009

PDF files search in sharepoint!

Hidden truth is, Sharepoint doesn't have any of its own search mechanisms. Its just a postman between the client and the database server, where all the sharepoint content is stored. Sharepoint server takes the request string and builds a query to be passed to the SQL server. Upon the query, SQl server passes the string to its own engine! File indexing is maintained on the Index server.

By default, SharePoint only searches the .txt, .htm, .doc, .xls, and .ppt. Coz, SQL Server can crawl through the base file extensions. PDF is a Binary file type can't really be searched with the SQL full text search engine, as it cant understand the format of PDF. So, Adobe came out with its own free filters.
Adobe 5.0 iFilter Download
Adobe 6.0 iFilter Download
You need to follow the installation procedure on the down load page and install the iFilter on the INDEX server.

PDF Docuement Full Text Search:

If you not install the iFilter on the Index server, PDF files will be shown in the search reasults matching the title name only. But, text inside the PDF file is not indexed and not shown in the search resutls. Inorder to index the content inside a PDF docuement(Full Text Search), you need to apply the iFilter on the Index server.

After the full crawl, new uploaded files that are uploaded before the Full Crawl and after the iFilter install are full text indexed automatically. But, some times the files which are uploaded before the IFilter was installed cannot be added to the full text indexing automatically. We need force the full text indexing on these old files by disabling full text indexing and enabling it again in central administration. If it’s not effected by using Central Admin UI, we can use Query Analyzer to force a rebuild, like..

USE WSS Content DB Name
EXEC sp_fulltext_catalog 'STS_servername', 'rebuild'

PDF Icon display:

By default PDF files are shown with an Internet explorer icon in the serach results. Follow the below procedure to associate an icon to the search results(Note: Even though you install iFilter on Index server, this PDF icon procedure has to be followed on each and every front end serer in the SharePoint server farm. i.e All web servers and Index serer)

  1. Go to Adobe site and copy paste the PDF icon(PDFicon.gif) in the below folder.
    drive:\Program Files\Common Files\Microsoft Shared\web server extensions\12\TEMPLATE\IMAGES
  2. Edit the document mapping XML file(Docicon.xml)
    Drive:\Program Files\Common Files\Microsoft Shared\Web server extensions\12\Template\Xml\Docicon.xml
  3. Add the below line under the ByExtension node.
    <Mapping Key="pdf" Value="pdfimage.gif"/>
  4. Recycle the search service:
    Run->cmd iisresetnet stop osearchnet start osearch

For 64-bit windows, you need to download the iFilter seperately. iFilter that comes along with the Adobe V8 installation supports only the 32-bit OS.

SQL Server's search engine deals with the base types and with the addition of new iFilters, it can go right with the associated formats. Microsoft comes out with a free iFilter for RTF. Other available IFilters for PDF, RTF, MSG, ZIP are fount at IFilterShop.

Additional procedure for Adobe v8 iFilter:

Adobe v8 comes along with the iFilter. So, no need to install a iFilter for Adobe V8 on a 32- bit windows. This applies to the 64-bit windows:

  1. Add the filter-extension to the File types crawled:
    Start -> Program -> Microsoft Office Server -> SharePoint 3.0 Central Administration -> Search Settings -> File Types -> New File Type (Add extension pdf here)
  2. Modify the following Registry keys by changing their "Default" value to the new CLSID of the Adobe IFilter: {E8978DA6-047F-4E3D-9C78-CDBE46041603} HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\.pdfDefault -> {E8978DA6-047F-4E3D-9C78-CDBE46041603} HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\.pdfDefault -> {E8978DA6-047F-4E3D-9C78-CDBE46041603}
  3. Add the Installation directory of the Adobe Reader v.8 to the System Path. For example, if the Reader is installed on "C:\Program Files\Adobe", then add"C:\Program Files\Adobe\Reader 8.0\Reader"or"C:\Program Files\Adobe\Reader 9.0\Reader" to the system path by: Right Click on My Computer -> Properties -> Advanced -> Environment Variables -> Path (Under System Variables) -> Edit -> (Add "C:\Program Files\Adobe\Reader 8.0\Reader"). This effectively tells the adobe IFilter where to pick up the dependent DLLs.
  4. Copy the .gif file that you want to use for the icon to the following folder on the server:SharePoint Server 2007- Drive:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\12\Template\Images
  5. Edit the DOCICON.xml file to include the .pdf extension. Navigate to SharePoint Server 2007 - Drive:\Program Files\Common Files\Microsoft Shared\Web server extensions\12\Template\Xml Open the Docicon.xml file. Add an entry for the .pdf extension Save the DOCICON.xml
    Recycle the search service: Run->cmd iisresetnet stop osearchnet start osearch
    Now you can crawl and search PDF documents with v.8 Reader.


No comments:

Post a Comment