Data Science: Data Science Statistics

Showing posts with label Data Science Statistics. Show all posts

Data Science Applications - Page Rank Text Summarization

What Is Text Summarization?

Text summarization is the most significant application in natural language processing.

It assists with reducing the quantity of original text and extracting just the relevant information.

The technique of text summarizing is also known as data reduction.

It entails generating an outline of the original text that allows the user to get key bits of information from that text in a much shorter amount of time.

Text Summarization Processes Types

The process of text summarizing may be categorized in many ways, including: The classification of the text summarizing process is shown in Figure.

As demonstrated, text summarization may be categorized into many categories, each of which can be further subdivided.

Depending on the number of documents Text summary is further divided into categories depending on the number of pages in a document:

• Single:

Because the outline is short, clear, and concise, it becomes more important.
Some subdocuments may be combined to form a single document.
They may be created out of certain subdocuments' documents that place unusual emphasis on different viewpoints, despite the fact that these reports all cover the same topic.

• Several:

A multi-document summary is a technique for managing a large amount of data in multiple linked supply documents by including just the most important information or main concepts in a little amount of space.
A multi-document report has recently become a hot topic in automated summarization.

A. Based on the Usage Summary Text summarization may be further subdivided into the following categories depending on summary usage:

• Generic Summaries: Generic summaries do not target any specific cluster since they are written for a large audience.
• Query-based: Query-based or subject-focused inquiries are tailored to an individual's or a group's unique requirements and address a single issue.

The goal of query-based text summarization is to extract fundamental information from the original text that answers the question.
The proper response is presented in a small, predetermined amount of words.

B. Techniques-based Text summarization may be further divided into subcategories based on the following techniques:

• Supervised:

Supervised text summarization is similar to supervised key extraction in that it is supervised.

Essentially, if you have a collection of documents and human-generated summaries for them, you can learn the characteristics of phrases that make them a good fit for inclusion in the summary.

• Unsupervised:

The use of unsupervised key extraction eliminates the need for training data.

It approaches the problem from a different perspective.

Rather of trying to learn explicit characteristics that characterize important words, the TextRank algorithm takes use of the content's structure to choose key phrases that seem "central" to the text, similar to how PageRank selects major websites.

C. Based on the Textual Characteristics of the Summary Text summarization may be classified into a variety of groups depending on the features of the summary text, such as:

• Abstractive Summarization:

Abstractive summarization methods change the material by adding new phrases, rephrasing, or inserting terms not found in the original text.

For a flawless abstractive summary, the model must first understand the text before expressing it with new words and phrases.

Complex elements like as generalization, paraphrase, and integrating real-world information are included.

• Extractive Summarization:

Summarization creates summaries by combining various portions of phrases taken from the source material.

In such situations, rating the importance of different phrases is often a major improvement.

A selection of essential data is extracted and then reassembled to provide a summary.

Algorithm of PageRank.

Around 1998, Page and Brin collaborated to create and improve the PageRank set of criteria. It was primarily used in the prototype of Google's search engine.

The purpose of this collection of criteria is to determine the popularity, or importance, of a website based on the concept of web interconnectivity.

According to the theory, a web page with more incoming hyperlinks performs a larger function than a web page with fewer incoming hyperlinks.

A online page having a hyperlink from a web page considered to be of extreme importance is also significant.

PageRank is one of the most widely used ranking algorithms, and it was created as a method for analyzing Weblinks.

The PageRank algorithm is used to calculate the weight of online pages, and it is the same concept that Google uses to give a rank to a web page based on a search result.

Data Science and Statistics

The statistical ideas utilized in data science are explained in this article. It's crucial to note that data science isn't a new idea, and that most statisticians are capable of working as data scientists.

Many principles from statistics are used in data science since statistics is the finest instrument for processing and interpreting data. Statistical techniques may help you extract a lot of information from the data collection.

You should study everything you can about statistics if you want to understand data science and become an expert in the area.

While there are numerous areas in statistics that a data scientist should be familiar with, the following are the most crucial:

Descriptive Statistics are a type of statistical analysis that is used to describe something.
Inferential Statistics are statistics that may be used to make inferences.

Descriptive statistics is the act of describing or looking at data in a way that makes it easier to understand.

This technique aids in the quantitative summarizing of data using numerical representations or graphs.

The following are some of the subjects you should study:

Normal Distribution
Central Tendency
Variability
Kurtosis

Normal Distribution

A normal distribution, also known as a Gaussian distribution, is a continuous distribution often used in statistics. Any data set following a normal distribution is spread across a graph, also a bell-shaped curve.

In normal distributions, the data points in the set peak at the center of the bell-shaped curve, which represents the center of the data set.

When the data moves away from the mean, it will fall to the end of the curve. You need to ensure the data you look at is distributed normally if you want to make inferences from the data set.

Central Tendency

Measures of central tendency aid in determining the data set's center values. The mean, median, and mode are the three most widely used measurements. Any distribution's mean, or arithmetic mean, is located at the middle of the data set.

The following formula may be used to compute the data set's mean or average:

(the total number of points in the data collection) / (number of data points)

Another metric is the median, which is the midpoint of the data set when the points are sorted ascending.

You can easily calculate the midpoint if you have an odd amount of values, but if you have an even number of data points, you take the average of the two data points in the middle of the data set.

The mode is the last metric, and its value is the data point that appears the most times in the data collection.

Variability

Variability is a factor that aids in determining the distance between the data points in a data collection and the average or mean of the data points.

This number also displays the difference between the chosen data points. Variability may be viewed and assessed using central measure metrics such as range, variation, and standard deviation.

The range is a number that represents the difference between the data set's lowest and greatest values.

Skewness and Kurtosis

The skewness of a data collection might assist you figure out how symmetrical it is. The data set will take the shape of a bell curve if it is spread evenly. The data is not skewed if the curve is formed equally.

The data is negatively or positively skewed if the curve goes to the right or left side of the data points, respectively. This indicates that the data is dominating on either the left or right side of the central tendency measurements.

Kurtosis is a metric that aids in determining the distribution's tails. You can tell if the data is light or heavy-tailed by plotting the dots on a graph. Based on the center region of the distribution, you may make this assumption.

Statistical Inference

Inferential statistics are used to get insights into a data collection. Descriptive statistics give information about the data.
Inferential statistics is concerned with drawing conclusions about a big population from a small sample of data.

Assume you're trying to figure out how many individuals in Africa have got the polio vaccination.

This analysis can be carried out in two ways: Inquire of every person in Africa if they have received the polio vaccination. Take a sample of people from throughout the continent, make sure they're from various sections, then extrapolate the results throughout the entire continent.

The first procedure is difficult, if not impossible, to accomplish. It's impossible to walk across the country asking people if they've got the vaccination.

The second technique is preferable since it allows you to draw conclusions or insights from the sample you've chosen and extrapolate the results to a larger population.

The following are some inferential statistics tools:

Theorem of the Central Limit

"The average of the sample equals the average of the total population," says the central limit theorem. This demonstrates that the sample and population have the same features and measurements of the data's central tendency, such as standard deviation.

This implies you may choose more data points, which will result in a normal curve.

If you wish to apply the central limit theorem, you must first grasp the idea of confidence intervals. This represents the population's mean value as a rough estimate.

The total of an error margin is used in the process of constructing an interval in the population. "Multiplying the standard error of the mean with the zscore of the percentage of confidence level" is one way to determine this mistake.

Testing Hypotheses

The extent to which you can test whatever assumption you make about the data set is known as hypothesis testing. You can collect the findings of your hypothesis analysis on a smaller population using this kind of testing.

The null hypothesis is the theory you'll be testing, and we'll be comparing it against the alternative hypothesis to see if it's valid. The case you must test is the null hypothesis.

Consider the following scenario: you're conducting a survey to find out who smokes and who doesn't, as well as how smokers are affected by cancer.

When conducting this survey, you make the assumption that the number of cancer patients who smoke is equal to the number of cancer patients who do not smoke. This is your null hypothesis, and you must test it in order to reject it.

The alternative hypothesis is that the number of cancer patients who smoke is higher than the number of cancer patients who do not. You may test hypotheses and evaluate data to see if the null hypothesis is valid or not using the data and evidence provided.

ANOVA (Analysis of Variance)

Another statistical technique used to test hypotheses across several sets of data is ANOVA. This idea aids in determining if the groups under consideration have similar averages and variations.

With ANOVA, you can perform this type of analysis with minimal error rates. The F-ratio may be used to compute ANOVA.

The F-ratio is a formula for calculating the ratio of mean square error between groups to mean square error within specific groups.

The procedures for calculating ANOVA are as follows:

1. Write the hypotheses and explain why they are needed. A null and alternative hypothesis should be included in every study.

2. If the null hypothesis is true, you must assume that the average of the groups is the same.

3. The alternative hypotheses' average will be different.

Pages