August 17, 2023

Web Mining

The best time to establish protocols with your clients is when you onboard them.

Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Introduction:

Web mining is the process of applying data mining techniques and algorithms to automatically discover and extract information from Web documents and services.

Web mining is a subset of data mining.

The contents of data mined from the Web may be a collection of facts that Web pages are meant to contain. These may consist of text, structured data such as lists and tables, and even images, video and audio.

The web has several aspects that yield multiple approaches for the mining process, such as web pages including text. Web pages are connected via hyperlinks, and user activity can be monitored via web server logs.

‍

Types of Web Mining

Web Content Mining
Web Structure Mining
Web Usage Mining

‍

Web content mining

Web content mining is the process of collecting useful data from websites
Basically, it is the browsing and mining of text, images and graphs of web pages based on a search query
Content includes news, comments, company information, product catalogues, etc

‍

Web structure mining

Web structure mining is the application of discovering structure information from the web.
The structure of the web graph consists of web pages as nodes, and hyperlinks as edges connecting related pages.
This connection enables us to extract data associated with the search query, directly pointing us to the connecting webpage from the website.
Organization of content from the web page in tree-structure format based on HTML and XML tags with in the page.

‍

Web usage mining

Web usage mining is the process of applying data techniques to derive useful data and information from the weblog.
It is used to meet needs of the user such as identifying the web pages that the user has accessed
This mining is used to observe the user behavior at the time of interacting with the web.
A Web Server actually registers the web log entry, for each access of web page. It contains the URL requested, IP address from the request side and timestamp.

‍

Why web mining?

Currently we’re living in the era of the internet. We are looking for content on search engines such as Google, Yahoo, and others. The search engine gives out a list of websites based on the query search.

So, we have to know exactly how the search engine works.

‍

In the early 90s, the first search engines used text-based ranking systems to decide which pages to return based on a given query.

Actually, the search engine browses through its index and counts the occurrences of the key words in each web file. The winners of the webpages are the pages with the highest number of occurrences of the key words. These websites display them back to the user.

‍

But the text-based ranking system was not ideal because there will be millions of webpages with that particular word. So the user does not scan all the webpages that contain a given word.

Some users anticipate only the top 5–20 webpages related to their relevant query search.

Modern search engines provide the best related results compared to text-based ranking systems. It uses one of the most influential algorithms for computing the relevance of web pages, the PageRank algorithm, used by the Google search engine.

‍

Searching Engine Types

Title-based Search Engine

Searches only with “titles”
Sorts the results

Full-Text Search Engine

E.g. Google
Examines all the words in every stored document and also uses PageRank

‍

PageRank

‍

PageRank is Google’s system of counting the links and the algorithmic method that google uses to rank pages and it assign a numeric value to the page.

With the help of that numeric value, it determines how important the webpage is.

‍

How is PageRank calculated?

‍

To calculate PageRank, all of the links from the web pages are taken into account.

There are three types of links

Inbound Link
Outbound Link
Dangling Link

‍

Page rank formula

PR(A) = (1-d) + d(PR(t1)/C(t1) + … + PR(tn)/C(tn))

This equation shows how important the webpage actually is.

Here,

t1,t2,…,tn are the web pages are linking to the webpage A.

C is the no of outbound links that a page has.

‘d’ is the damping factor (i.e., d= 0.85)

‍

Now we’ll look into some examples of how PageRank works.

Example 1:

Let us assume four web pages A, B, C, and D

Let each page have a PageRank of 0.25

The Page Rank of web page A has PR(A) = PR(B) + PR(C) + PR(D)

‍

Example 2:

‍

The Page Rank of web page A has

PR(A) = PR(B)/L(B) + PR(C)/L(C) + PR(D)/L(D)

L(B), L(C), L(D) is the number of outbound links of page B, C, D.

PR(A) = PR(B)/2 + PR(C)/1 + PR(D)/3

The parameter d is the damping factor which can be set between 0 and 1 (d is set to 0.85)

The Page rank of Web Page A has:

PR(A) = 1-d/N + d(PR(B)/L(B) + PR(C)/L(C) + PR(D)/L(D))

‍

Implementation of page rank algorithm

networkx is a package available in Python to create graph structures, calculate page rank, total nodes and total edges in a page.

‍

Conclusion

It is not the only algorithm used by search engines, as Google also utilizes other algorithms and methods to rank their pages these days. But it was instrumental in launching Google to the forefront and demonstrates the power of web mining.

‍

CodeStax.Ai

Profile

August 17, 2023

min read

Subscribe to our newsletter

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Share this article:

Web Mining

Heading

More articles

CodeStax.Ai

Serverless Architectures: Beyond Lambda

Serverless architectures specify a change in our process to produce and execute applications.

CodeStax.Ai

AWS Neptune Demystified: Your Guide to Graph Databases and Gremlin Queries

The knowledge on graph databases is crucial as we live

CodeStax.Ai

Introduction to AWS SAM CLI: Simplify Serverless Development

The Serverless architecture in cloud computing helps developers

CodeStax.Ai

Automating AWS Lambda Version Cleanup with Node.js and AWS SDK

In the realm of serverless computing, AWS Lambda functions

CodeStax.Ai

AWS CodeCommit — Version control for beginners

Nowadays, software development is a field where speed is crucial.

CodeStax.Ai

How to deploy Bun.js in AWS Lambda?

JavaScript is one of the most popular and widely used

CodeStax.Ai

Amazon CodeWhisperer: AI-Powered Suggestions and Security Boost

Amazon CodeWhisperer utilizes machine learning

CodeStax.Ai

Elements on a web page can be located using XML expressions with Selenium’s XPath locator.

S3 is excellent for storing files

CodeStax.Ai

AWS — Log Anomaly Detection and Recommendations

Developers can now more effectively monitor and troubleshoot their applications

CodeStax.Ai

AWS Fargate and AWS Lambda which one to choose for your project?

AWS Fargate and AWS Lambda

CodeStax.Ai

Advanced Queries For AWS Timestream

Window functions in Timestream give you extensive analytical capabilities

CodeStax.Ai

AWS Lambda Foundations

There are three patterns to invoke a Lambda function, called Invocation models. The invocation model to be used depends on the event source

CodeStax.Ai

Automating Reconciliation Using AWS Glue

AWS Glue is a fully managed ETL service that makes it easy to move data

CodeStax.Ai

AWS Lambda with SQS — Setup SQS Trigger to Lambda

AWS Lambda is an event-driven, server-less computing platform provided by Amazon.

CodeStax.Ai

Storing Secure Configuration Data with AWS Parameter Store: A Step-by-Step Tutorial

Amazon Web Services (AWS) Parameter Store is a service that enables you to

CodeStax.Ai

AWS Timestream — Introduction

AWS Timestream is comparable to Graphite and Influx.

CodeStax.Ai

Getting Started With AWS Fargate

Deploying the application to the web is a burden and maintaining the server is also another big task for the DevOps engineers.

CodeStax.Ai

Managing users with AWS Cognito

Cognito is known for authentication, authorization and user management for mobile and web applications

CodeStax.Ai

Streaming QLDB Journal data to Lambda

In this article we’ll discuss how to stream QLDB (Quantum Ledger Database)

CodeStax.Ai

Creating an Automated Deployment Pipeline - CodeCommit to Lambda

“Merge conflict” is one of the worst messages a developer can see in Git.

CodeStax.Ai

Encryption is a way of scrambling data so that only authorized parties can understand the information.

Quantum Ledger Database (QLDB) is a No-SQL (Semi-SQL & Semi-NoSQL)

CodeStax.Ai

Speed up your lambda functions

AWS Lambda is a popular serverless computing service offered by Amazon Web Services (AWS).

CodeStax.Ai

Creating Serverless APIs with DynamoDB and Lambda

This article will teach you how to build a server-less backend API using DynamoDB as the database.

CodeStax.Ai

AWS Lambda technical Constraints

Amazon Web Services Lambda is a Serverless event-driven computing platform that was launched in November 2014

CodeStax.Ai

Getting Started with Dynamoose

Dynamoose is a Node.js modeling tool built for AWS DynamoDB.

CodeStax.Ai

Multi-threaded JavaScript with the Event Loop: Breaking the Single-threaded Barrier