Repository logo

Data extraction from the Web using XML.

dc.contributor.advisorKarmouch, Ahmed,
dc.contributor.authorOuahid, Hicham.
dc.date.accessioned2009-03-23T18:26:36Z
dc.date.available2009-03-23T18:26:36Z
dc.date.created2001
dc.date.issued2001
dc.degree.levelMasters
dc.degree.nameM.A.Sc.
dc.description.abstractThis thesis presents a mechanism based on eXtensible Markup Language (XML) to extract data from HTML-based Web pages and populate relational databases. This task is performed by a system called the XML-based Web Agent (XWA). The data extraction is done in three phases. First, the Web pages are converted to well-formed XML documents to facilitate their processing. Second, the data is extracted from the well-formed XML documents and formatted into valid XML documents. Finally, the valid XML documents are mapped into tables to be stored in a relational database. To extract specific data from the Web, the XWA requires information about the Web pages from which to extract the data, the location of the data within the Web pages, and how the extracted data should be formatted. This information is stored in Web Site Ontologies which are built using a language called the Web Ontology Description Language (WONDEL). WONDEL is based on XML and XML Pointer Language. It has been defined as a part of this work to allow users to specify the data they want, and let the XWA work offline to extract it and store it in a database. This has the advantage of saving users the time waiting for the Web pages to download, and taking benefit from the powerful query mechanism offered by database management systems.
dc.format.extent132 p.
dc.identifier.citationSource: Masters Abstracts International, Volume: 40-05, page: 1260.
dc.identifier.isbn9780612660960
dc.identifier.urihttp://hdl.handle.net/10393/9260
dc.identifier.urihttp://dx.doi.org/10.20381/ruor-7721
dc.publisherUniversity of Ottawa (Canada)
dc.subject.classificationInformation Science.
dc.titleData extraction from the Web using XML.
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail ImageThumbnail Image
Name:
MQ66096.PDF
Size:
4.44 MB
Format:
Adobe Portable Document Format