[Review] PageRank: Standing on the Shoulders of Giants

Jan 26, 2024

Contents

Introduction ~ Ranking Web Pages using PageRank 2. Web information retrieval 3. Ranking Web pages using PageRank Computing the PageRank Vector ~ Hubs and Authorities on the Web Computing the PageRank vector Standing on the shoulders of giants Bibliometrics ~ Conclusion Conclusion

keywords: PageRank, Web information retrieval, Bibliometrics, Sociometry, Econometrics.

Introduction ~ Ranking Web Pages using PageRank

1. Introduction

PageRank is a Web page ranking technique that has been a fundamental ingredient in the development and success of the Google search engine.

→ 페이지랭크는 Google 검색 엔진의 개발과 성공에 기본 요소가 된 웹 페이지 순위 기술이다.

The method is still one of the many signals that Google uses to determine which pages are most important.

→ 이 방법은 여전히 Google에서 가장 중요한 페이지를 결정하는 데 사용하는 여러 신호 중 하나이다.

The main idea behind PageRank is to determine the importance of a Web page in terms of the importance assigned to the pages hyperlinking to it.

→ PageRank의 주요 아이디어는 해당 페이지로 하이퍼링크된 페이지들의 중요도에 따라 웹 페이지의 중요성을 결정하는 것이다.

In fact, this thesis is not new, and has been previously successfully exploited in different contexts.

→ 사실, 이 이론은 새로운 것이 아니고 이전부터 다른 문맥(연구?)으로부터 성공적으로 활용되었다.

We review the PageRank method and link it to some renowned previous techniques that we have found in the fields of Web information retrieval, bibliometrics, sociometry, and econometrics.

→ PageRank 방법을 검토하고 웹 정보 검색, 서지 측정, 사회 측정 및 계량 경제학 분야에서 발견한 몇 가지 유명한 이전에 알려진 기술과 연결한다.

→ 우리꺼 중요하다 는 것을 강조

2. Web information retrieval

💡

In 1945 Vannevar Bush wrote a today celebrated article in The Atlantic Monthly entitled “As We May Think” describing a futuristic device he called Memex.

→ 1945년에 Vannevar Bush는 "우리가 생각할 수 있는 것처럼"이라는 제목으로 The Atlantic Monthly 논문에 미래지향적인 장치인 Memex를 설명하는 글을 썼다.

→Bush는 Memex라는 장치를 통해 정보의 연결과 저장을 새로운 방식으로 상상하며, 현대 인터넷의 초기 아이디어 중 하나를 제시했다.

Bush writes: “Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the Memex and there amplified.”

→ Bush: “완전히 새로운 형태의 백과사전이 나타날 것이며 연관된 경로들의 Mesh (망 형태)로 준비되어 Memex에 투입되어 확장(증폭)될 것이다."

→ 웹상의 하이퍼링크와 유사한 개념으로 정보들이 상호 연결되어 쉽게 탐색될 수 있음을 예측한 것

Bush’s prediction came true in 1989, when Tim Berners-Lee proposed the Hypertext Markup Language (HTML) to keep track of experimental data at the European Organization for Nuclear Research (CERN).

→ Bush의 예측은 1989년에 Tim Berners-Lee가 유럽 핵 연구소(CERN)에서 실험 데이터를 추적하기 위해 하이퍼텍스트 마크업 언어 (HTML)를 제안하면서 실현되었다.

→ Berners-Lee의 HTML 제안은 현대 웹의 기초를 마련하고 정보를 상호 연결된 방식으로 저장하고 접근할 수 있는 시스템을 구축하는 데 중요한 역할을 했다.

In the original far-sighted proposal in which Berners-Lee attempts to persuade CERN management to adopt the new global hypertext system we can read the following paragraph:

→ Berners-Lee가 CERN 경영진을 새로운 글로벌 하이퍼텍스트 시스템을 적용하도록 설득하려는 원시안적인(far-sighted)한 제안서에서 다음과 같은 단락을 읽을 수 있다

→ Berners-Lee가 HTML의 적용에 대한 제안의 중요성을 강조하고 새로운 정보 시스템(HTML)의 잠재력을 설득하기 위해 노력했음을 나타낸다.

“We should work toward a universal linked information system, in which generality and portability are more important than fancy graphics techniques and complex extra facilities.

→ 우리는 화려한 그래픽 기술이나 복잡한 추가 기능보다 일반성과 이식성이 중요한 일반적인 연결된 정보시스템을 향해 나아가야 한다.

→ 정보의 접근성과 이동성을 강조하며, 복잡한 기술보다 사용자에게 유용한 정보의 연결과 공유에 초점을 맞추는 것의 중요성을 강조한다.

The aim would be to allow a place to be found for any information or reference which one felt was important, and a way of finding it afterwards.

→ 목표는 중요하다고 느끼는 어떠한 정보나 참고자료를 찾을 수 있는 장소를 제공하고, 그 후에 그것을 찾는 방법을 허용하는 것이다.

→ 웹의 기본 원칙 중 하나인 정보의 접근성과 검색 용이성에 대한 Berners-Lee의 비전을 나타낸다.

The result should be sufficiently attractive to use that the information contained would grow past a critical threshold.”

→ 결과적으로 사용하기에 충분히 매력적이어야 하며 그 안에 포함된 정보가 중요한 임계점을 넘어 성장해야 한다.

→ 이는 웹이 사용자에게 매력적이고 유용해야 하며 그로 인해 지속적으로 정보가 증가하고 발전해야 한다는 것을 의미한다.

As we all know, the proposal was accepted and later implemented in a mesh – this was the only name that Berners-Lee originally used to describe the Web – of interconnected documents that rapidly grew beyond the CERN threshold, as Berners-Lee anticipated, and became the World Wide Web.

→ 우리가 알다시피 이 제안은 나중에 받아들여져서 상호연결된 서류의 mesh(망)로 구현되었다.

mesh(망)이란 Berners-Lee가 원래 웹을 설명하기 위해 사용한 유일한 이름이다.

mesh는 CERN의 임계점을 넘어 빠르게 성장하여 Berners-Lee가 예상처럼 월드 와이드 웹이 되었다.

→ Berners-Lee가 제안한 웹이 처음으로 구현되었을 때 그 기본 구조와 Berners-Lee의 비전이 어떻게 현실이 되었는지를 설명한다.

Today, the Web is a huge, dynamic, self-organized, and hyperlinked data source, very different from traditional document collections which are nonlinked, mostly static, centrally collected and organized by specialists.

→ 오늘날, 웹은 거대하고 동적이며, 자가 조직화되고 하이퍼링크된 데이터 소스로,

비연결되고 대부분 정적이며, 중앙 집중식으로 전문가에 의해 수집 및 조직된 전통적인 문서 컬렉션과는 매우 다르다.

→ 오늘날 웹이 기존의 정적이고 중앙 집중식 문서 컬렉션과는 다르게 독특한 특성을 가지고 있으며, 특성이 웹 정보 검색을 전통적인 정보 검색과 다르게 만든다.

These features make Web information retrieval quite different from traditional information retrieval and call for new search abilities, like automatic crawling and indexing of the Web.

→이러한 특징들은 웹 정보 검색을 전통적인 정보 검색과 상당히 다르게 만들고, 웹의 자동 크롤링 및 색인화와 같은 새로운 검색 능력을 요구한다.

→ 웹의 거대하고 동적이고 자가조직화되고 하이퍼링크된 데이터 소스로 연결된 구조는 전통적인 검색 방식으로는 다루기 어렵기 때문에 웹 크롤링 및 색인화와 같은 새로운 기술이 필요하다.

Moreover, early search engines ranked responses using only a content score, which measures the similarity between the page and the query.

→ 또한 이전 검색 엔진들은 페이지와 검색어 사이의 유사성을 측정하는 내용 점수만을 사용하여 응답을 순위 매겼다.

→ 이전 검색 엔진들은 주로 페이지 내용이 사용자의 검색어와 얼마나 일치하는지를 바탕으로 페이지의 순위를 결정

One simple example is just a count of the number of times the query words occur on the page, or perhaps a weighted count with more weight on title words.

→ 한 간단한 예제는 페이지에서 검색어가 나타나는 횟수를 세거나 아마도 제목에 있는 단어에 더 많은 가중치를 두어 세는 것이다

→ 이전 검색 엔진이 사용한 기본적인 방법으로 페이지의 내용과 사용자 쿼리의 일치도를 평가하는 방식을 나타낸다.

These traditional query-dependent techniques suffered under the gigantic size of the Web and the death grip of spammers.

→ 쿼리 의존적인 전통적인 기술은 웹의 거대한 크기와 스패머들의 영향으로 어려움을 겪었다.

→ 전통적인 검색 기술은 웹의 규모와 스팸 콘텐츠의 증가로 인해 효율성과 정확성에 문제를 겪었다.

In 1998, Sergey Brin and Larry Page revolutionised the field of Web information retrieval by introducing the notion of an importance score, which gauges the status of a page, independently from the user query, by analysing the topology of the Web graph.

→ 1998년에 Sergey Brin과 Larry Page는 웹 그래프의 구조(토폴로지)를 분석하여 사용자 쿼리와 독립적으로 페이지의 상태를 측정하는 중요도 점수 개념을 도입함으로써 웹 정보 검색 분야에 혁명을 일으켰다.

→ 이는 PageRank 알고리즘의 핵심 개념으로, 페이지의 중요성을 결정하는 새로운 방법을 제시하며 웹 검색의 패러다임을 바꿨다.

The method was implemented in the famous PageRank algorithm and both the traditional content score and the new importance score were efficiently combined in a new search engine named Google. → 이 방법은 유명한 PageRank 알고리즘에서 구현되었고, 전통적인 내용 점수와 새로운 중요도 점수가 Google이라는 새로운 검색 엔진에서 효율적으로 결합되었다. → Google은 PageRank 알고리즘을 사용하여 전통적인 페이지와 검색어 사이의 유사성을 측정하는 내용 기반 순위와 페이지의 중요도를 결합한 새로운 검색 엔진으로 혁신을 이루었다.

3. Ranking Web pages using PageRank

We briefly recall how the PageRank method works keeping the mathematical machinery to the minimum.

→ 어떻게 PageRank 방법이 수학적 메커니즘을 최소화하면서 작동하는지 간략하게 상기한다.

Interested readers can more thoroughly investigate the topic in a recent book of Langville and Meyer which elegantly describes the science of search engine rankings in a rigorous yet playful style.

→ 관심있는 독자들은 Langville과 Meyer의 최근 책에서 이 주제를 자세하게 볼 수 있다. 이 저자들은 검색 엔진 순위에 대해서 유쾌하고 엄격한 스타일로 우아하게 설명해준다.

We start by providing an intuitive interpretation of PageRank in terms of random walks on graphs.

→ 그래프의 무작위 도보와 관련하여 PageRank에 대한 직관적인 해석을 제공한다.

The Web is viewed as a directed graph of pages connected by hyperlinks.

→ 웹은 하이퍼링크로 연결된 페이지의 방향성 그래프로 표시된다

A random surfer starts from an arbitrary page and simply keeps clicking on successive links at random, bouncing from page to page.

→ 무작위 서퍼는 임의의 페이지에서 시작하여 무작위로 연속적인 링크를 클릭하며 페이지에서 페이지로 이동한다.

→ PageRank 알고리즘에서 무작위 서퍼의 행동을 통해 페이지의 중요성을 파악한다.

The PageRank value of a page corresponds to the relative frequency that the random surfer visits that page, assuming that the surfer goes on infinitely.

→ 페이지의 PageRank 값은 무작위 서퍼가 해당 페이지를 방문하는 상대적 빈도에 대응되며, 서퍼가 무한히 계속 간다고 가정한다.

→ 페이지가 무작위 서퍼에 의해 방문될 확률이 높을수록 그 페이지의 PageRank 값이 높아진다.

The more time spent by the random surfer on a page, the higher the PageRank importance of the page.

→ 무작위 서퍼가 페이지에 머무는 시간이 길수록 해당 페이지의 PageRank 중요도가 높아진다.

→ 페이지가 사용자에게 유용하고 관심을 끌수록 PageRank에서 높은 평가를 받는다.

A little more formally, the method can be described as follows.

→ 보다 더 공식적으로, 이 방법은 아래의 과정으로 설명할 수 있다.

→ PageRank 계산 방법에 대한 보다 구체적인 수학적 설명을 시작

Let us denote by

q_i

the number of distinct outgoing (hyper)links of page

i

→ 페이지 $i$ 의 고유한 나가는 (하이퍼)링크의 수를 $q_i$ 로 표시한다.

→ 각 웹 페이지가 가지고 있는 나가는 링크의 수를 나타내는 변수의 정의

Let

H = (h_{ij})

be a square matrix of size equal to the number

n

of Web pages such that

h_{ij} = 1/q_i

if there exists a link from page

i

to page

j

and

h_{ij} = 0

otherwise.

→

H = (h_{ij})

를 웹 페이지의 수 $n$ 과 같은 크기의 정사각행렬로 정의하고 페이지 $i$ 에서 페이지 $j$ 로 링크가 있으면

h_{ij} = 1/q_i

그렇지 않으면

h_{ij} = 0

이다.

→ 웹 페이지 간의 링크 구조를 나타내는 행렬 $H$ 의 정의다. 링크가 있으면 해당 확률 값을 없으면 0을 할당한다.

The value

h_{ij}

can be interpreted as the probability that the random surfer moves from page

i

to page

j

by clicking on one of the distinct links of page

i

→ $h_{ij}$ 값은 무작위 서퍼가 페이지 $i$ 의 고유한 링크 중 하나를 클릭하여 페이지 $i$ 에서 페이지 $j$ 로 이동할 확률로 해석될 수 있다.

→ 행렬

H

의 각 요소가 특정 페이지에서 다른 페이지로 이동할 확률을 나타냄을 설명

The PageRank

\pi_j

of page

j

is recursively defined as

→ 페이지 $j$ 의 PageRank $\pi_j$ 는 재귀적으로

\pi_j = \Sigma_i \pi_i h_{i,j}

로 정의된다.

→ 각 페이지의 PageRank 값이 다른 페이지들로부터의 링크와 그 페이지들의 PageRank 값에 의해 영향을 받는다는 것을 나타낸다.

ㅤ

\pi = \pi H

ㅤ

or, in matrix notation,

j

Hence, the PageRank of page

i

is the sum of the PageRank scores of pages

j

linking to

i

, weighted by the probability of going from

j

\pi = \pi H

→ 또는 행렬 표기법으로

j

이다. 따라서 페이지 $i$ 의 PageRank는 페이지 $j$ 에서 $H$ 로 가는 확률에 의해 가중된 i에서 j로 링크되는 페이지들의 PageRank 점수의 합이다.

→ 페이지의 PageRank는 그 페이지로 링크되는 다른 페이지들의 PageRank 값에 의해 결정되며, 링크의 확률적 가중치가 고려된다.

In words, the PageRank thesis reads as follows:

→ 말로 표현하면, PageRank 논리는 다음과 같다:

A Web page is important if it is pointed to by other important pages.

→ 웹 페이지는 다른 중요한 페이지들에 의해 가리켜질 경우 중요하다.

—> 웹 페이지의 중요성은 다른 중요한 페이지들에 의해 링크되는 정도에 의해 결정된다

There are in fact three distinct factors that determine the PageRank of a page: (i) the number of links it receives, (ii) the link propensity, that is, the number of outgoing links, of the linking pages, and (iii) the PageRank of the linking pages.

→ 실제로 페이지의 PageRank를 결정하는 세 가지 구별되는 요소가 있다.

(i) 페이지가 받는 링크의 수

(ii) 링크하는 페이지의 링크 경향, 즉 나가는 링크의 수

(iii) 링크하는 페이지들의 PageRank

→ 페이지의 받는 링크의 수, 링크를 보내는 페이지들의 나가는 링크 수, 그리고 그 링크하는 페이지들의 PageRank에 의해 영향을 받는다.

The first factor is not surprising: the more links a page receives, the more important it is perceived.

→ 첫 번째 요소는 놀랍지 않다: 페이지가 받는 링크가 많을수록, 그 페이지는 더 중요하게 인식된다. → 많은 수의 링크를 받으면 그 페이지가 더 중요하다고 간주되는 것은 일반적인 관찰이다.

Reasonably, the link value depreciates proportionally to the number of links given out by a page: endorsements coming from parsimonious pages are worthier than those emanated by spendthrift ones.

→ 합리적으로, 페이지에서 나가는 링크 수에 비례하여 링크 가치는 감소한다:

검소한 페이지로부터 나오는 추천은 낭비하는 페이지로부터 나오는 추천보다 가치가 있다. → 페이지가 많은 링크를 보내는 경우, 그 페이지의 링크 하나하나의 가치는 상대적으로 낮아진다는 것을 의미

Finally, not all pages are created equal: links from important pages are more valuable than those from obscure ones.

→ 마지막으로, 모든 페이지가 동일하게 생성되는 것은 아니다. 중요한 페이지로부터의 링크는 덜 알려진 페이지로부터의 링크보다 더 가치가 있다.

Unfortunately, this ideal model has two problems that prevent the solution of the system.

→ 불행히도, 이 이상적인 모델에는 시스템의 해결을 방해하는 두 가지 문제가 있다.

The first one is due to the presence of dangling nodes, that are pages with no forward links.

→ 첫 번째 문제는 전방 링크가 없는 페이지인 dangling nodes의 존재 때문이다.

→ 링크를 다른 페이지로 전혀 보내지 않는 페이지들이 PageRank 계산에서 문제를 일으킬 수 있음

These pages capture the random surfer indefinitely.

→ 이러한 페이지들은 무작위 서퍼를 무기한으로 붙잡는다.

→ 즉 무작위 서퍼 모델에서, 링크가 없는 페이지는 서퍼가 다른 페이지로 이동할 수 없게 만들어 계산에 문제를 발생

Notice that a dangling node corresponds to a row in matrix

H

with all entries equal to 0.

→ dangling 노드는 행렬 $H$ 의 모든 항목이 0인 행에 해당한다는 점에 유의

→ dangling 노드는 행렬 $H$ 에서 링크가 없어 0으로만 이루어진 행을 나타내어 계산에 문제

To tackle the problem of dangling nodes, the corresponding rows in

u = 1 / ne

are replaced by the uniform probability vector

e

, where

n

is a vector of length

H

with all components equal to 1.

→ dangling 노드 문제를 해결하기 위해, $u=1/ne$ 의 해당 행들은 동일한 확률 벡터

e

로 대체된다. 여기서 $u$ 는 모든 구성 요소가 1인 길이 n의 벡터이다.

→ dangling 노드가 있는 경우, 해당 페이지에서 임의의 페이지로 이동할 수 있도록 하여 PageRank 계산에서의 문제를 해결

Alternatively, one may use any fixed probability vector in place of

u

→ 대안으로, $S$ 대신에 어떤 고정된 확률 벡터를 사용할 수도 있다

→ 다른 확률 벡터를 사용함으로써 다양한 방식으로 PageRank 알고리즘을 조정할 수 있다.

This means that the random surfer escapes from the dangling page by jumping to a randomly chosen page.

→ 이것은 무작위 서퍼가 dangling 페이지에서 탈출하여 무작위로 선택된 페이지로 점프한다는 것을 의미한다.

→ 노드 문제 해결을 위해, 무작위 서퍼가 다른 페이지로 무작위로 이동할 수 있도록 함으로써 계산을 보정

We call

S

the resulting matrix.

→ 이 결과로 나온 행렬을

S

라고 부름

The second problem with the ideal model is that the surfer can get trapped into a bucket of the Web graph, which is a reachable strongly connected component without outgoing edges towards the rest of the graph.

→ 두 번째 문제점은 서퍼가 웹 그래프의 버킷에 갇힐 수 있다는 것인데, 이는 그래프의 나머지 부분으로 나가는 가장자리가 없이 강하게 연결된 구성 요소로, 도달할 수 있습니다.

The solution proposed by Brin and Page is to replace matrix

E

by the Google matrix

→ Brin과 Page가 제안한 해결책은 행렬 S를 Google 행렬로 대체하는 것입니다

ㅤ

u

ㅤ

where

\alpha

is the teleportation matrix with identical rows each equal to the uniform probability vector

\alpha

, and

v

is a free parameter of the algorithm often called the damping factor.

→ 여기서 E는 균일 확률 벡터 u와 동일한 행을 가진 순간이동 행렬이며,

u

는 흔히 감쇠 계수(damping factor)라고 하는 알고리즘의 자유 파라미터입니다.

Alternatively, a fixed personalization probability vector

\alpha

can be used in place on

1-\alpha

→ 또는 고정된 개인화 확률 벡터 v를 u 대신 사용할 수도 있습니다

In particular, the personalization vector can be exploited to bias the result of the method towards certain topics.

→ 특히 개인화 벡터를 활용하여 방법의 결과를 특정 주제에 편향시킬 수 있습니다.

The interpretation of the new system is that, with probability

\alpha

the random surfer moves forward by following links, and, with the complementary probability

\alpha

the surfer gets bored of following links and enters a new destination in the browser’s URL line, possibly unrelated to the current page.

→ 새로운 시스템의 해석에 따르면, 무작위 서퍼가 링크를 따라 앞으로 이동할 확률이

\alpha = 0.85

라면, 상보적인 확률 1 -

\alpha

는 서퍼가 링크를 따라가는 것에 지루함을 느껴 브라우저의 URL 라인에 현재 페이지와 관련이 없는 새로운 목적지를 입력한다는 것입니다.

The surfer is hence teleported, like a Star Trek character, to that page, even if there exists no link connecting the current and the destination pages in the Web universe.

→ 따라서 웹 세계에서 현재 페이지와 대상 페이지를 연결하는 링크가 없더라도 서퍼는 Star Trek 캐릭터처럼 해당 페이지로 순간 이동됩니다.

The inventors of PageRank propose to set the damping factor

\alpha = 0.85

, meaning that after about five link clicks the random surfer chooses a random page.

→ PageRank의 발명가는 감쇠 계수

\Sigma _i \pi_i = 1

= 0.85를 설정할 것을 제안합니다. 이는 약 5번의 링크 클릭 후 무작위 서퍼가 무작위 페이지를 선택한다는 것을 의미합니다.

The PageRank vector is then defined as the solution of equation:

→ 그런 다음 PageRank 벡터는 방정식의 해로 정의됩니다.

ㅤ

\pi = \alpha \pi S + (1 - \alpha) u

ㅤ

(1)

Figure 1:A PageRank instance with solution. Each node is labelled with its PageRank score. Scores have been normalized to sum to 100. We assumed

\pi S

An example is provided in Figure 1.

Node A is a dangling node, while nodes B and C form a bucket.

Notice the dynamics of the method: page C receives just one link but from the most important page B; its importance is much higher than that of page E, which receives many more links, but from anonymous pages.

그림 1에 예시가 나와 있습니다. 노드 A는 댕글링 노드이고 노드 B와 C는 버킷을 형성합니다. 이 방법의 역학 관계에 주목하세요. 페이지 C는 단 하나의 링크만 받지만 가장 중요한 페이지 B로부터 링크를 받으며, 그 중요성은 더 많은 링크를 받지만 익명 페이지에서 받는 페이지 E보다 훨씬 높습니다.

Pages G, H, I, L, and M do not receive endorsements; their scores correspond to the minimum amount of status of each page.

페이지 G, H, I, L, M은 추천을 받지 못하며, 점수는 각 페이지의 최소 상태와 일치합니다.

Typically, the normalization condition

u

is also added.

In this case Equation (1) becomes

\alpha

The latter distinguishes two factors contributing to the PageRank vector: an endogenous factor equal to

\sum_i\pi_i

which takes into consideration the real topology of the Web graph, and an exogenous factor equal to the uniform probability vector

\alpha

, which can be interpreted as a minimal amount of status assigned to each page independently of the hyperlink graph.

The parameter

\alpha

balances between these two factors.

일반적으로 정규화 조건(

\alpha

= 1)도 추가됩니다. 이 경우 방정식 1은 π =

S

πS+(1 -

E

)u가 됩니다.

후자는 PageRank 벡터에 기여하는 두 가지 요소를 구별합니다. 웹 그래프의 실제 토폴로지를 고려한 πS와 동일한 내생 요소와 최소량으로 해석될 수 있는 균일 확률 벡터 u와 동일한 외생 요소입니다. 하이퍼링크 그래프와는 별도로 각 페이지에 할당된 상태입니다. 매개변수

G

는 이 두 요소 사이의 균형을 유지합니다.

Computing the PageRank Vector ~ Hubs and Authorities on the Web

Computing the PageRank vector

Does Equation 1 have a solution? Is the solution unique? Can we efficiently compute it? The success of the PageRank method rests on the answers to these queries. Luckily, all these questions have nice answers.

→

1) 수식 1에 해가 있는가?

2) 해가 고유한가(=유일한가)?

3) 효율적으로 계산할 수 있는가?

PageRank 방법의 성공은 3개의 질문이 남아있다.

운이 좋게도 모든 질문에는 좋은 답이 있다.

Thanks to the dangling nodes patch, matrix

S

is a stochastic matrix, and clearly the teleportation matrix

E

is also stochastic. It follows that

G

is stochastic as well, since it is defined as a convex combination of stochastic matrices

x

and

xA = rx, x > 0

. It is easy to show that, if

\Sigma_ix_i = 1

is stochastic, Equation 1 has always at least one solution. Hence, we have got at least one PageRank vector. Having two independent PageRank vectors, however, would be already too much: which one should we use to rank Web pages? Here, a fundamental result of algebra comes to the rescue : Perron-Frobenius theorem.

It states that, if A is an irreducible nonnegative square matrix, then there exists a unique vector

r

, called the Perron vector, such that

A

, and

A

, where

A

is the maximum eigenvalue of

S

in absolute value, that algebraists call the spectral radius of

G

. The Perron vector is the left dominant eigenvector of

G

, that is, the left eigenvector associated with the largest eigenvalue in magnitude.

→ dangling 노드 패치 덕분에, 행렬 $S$ 는 확률적 행렬이고 텔레포트 행렬 $G$

또한 확률적 행렬이다. 이에 따라 행렬

G

또한 확률적 행렬의 볼록 조합(convex combination)으로 정의되기 때문에 확률적 행렬이다.

💡

확률 행렬이란 행렬의 모든 항목이 0과 1 사이의 값을 가지며, 각 행의 합이 1인 행렬을 의미. 여기서 'dangling 노드들의 패치'는 웹 페이지 간 링크가 없는 경우를 처리하는 것을 의미하며, 이를 통해 확률 행렬 $G$ 와 $G$ 가 생성. Convex combination이란 어떤 행렬을 결합할 때 각각의 행렬에 가중치를 부여하고 이 가중치의 합이 1이 되는 경우를 말하며, Convex combination의 성질은 행렬의 각 요소가 확률 행렬인 경우의 두 확률 행렬의 convex combination을 통해 생성된 행렬 또한 확률 행렬이 된다.

→ 만약 G가 확률 행렬이라면, 수식 1은 항상 최소한 하나의 해를 가진다는 것을 쉽게 보일 수 있다. → 확률적 행렬(==마르코프 행렬==전이행렬)의 기본 속성

💡

마르코프 행렬(==확률적 행렬 == 전이행렬)이란?

[자세히보기]

[출처: https://twlab.tistory.com/53]

확률을 이용하여 어떤 객체 상태를 시간에 따라 어떻게 변화할지를 모델링(modeling)하는 것

보다 자세히 말하자면 다음 상태가 오직 현재 상태에만 의존하는 확률 과정을 모델링

마르코프 체인의 특성

모든 원소의 값이 0보다 크거나 같다.

→ 오직 양수만 허용(확률값이므로 음수X)

각 행의 합은 1이 된다

→ 여기서는 열의 합이 1이되게 설정

고유값이 0일 때 → steady state(정상상태)

→ 따라서 최소 한 개 이상의 PageRank vector를 갖는다. 그러나 두 개의 독립적인 PageRank 벡터를 가지는 것은 과하다: 우리는 어떤 것을 웹 페이지 순위 매기기에 사용해야 하는가?

💡

PageRank 벡터의 유일성이 중요하다는 점을 강조. 만약 두 개 이상의 PageRank 벡터가 존재한다면, 어떤 벡터를 웹 페이지의 순위 결정에 사용해야 할지 명확X

→ 여기서 대수학의 기본적인 결과가 해결책을 제공한다: 페론-프로베니우스 정리.

💡

페론-프로베니우스 정리는 비음수 행렬에 대한 중요한 수학적 정리. PageRank와 같은 문제에서 유일한 해의 존재를 보장하는 데 중요한 역할을 함

→ 이 정리는, 만약 $\pi^{(0)} = u = 1/ne.$ 가 기약(irreducible) 비음수 정사각 행렬이라면, $\pi^{(k+1)} = \pi^{(k)} G$ 을 만족하는 유일한 벡터 $||\pi^{(k+1)} - \pi{(k)}|| < \epsilon$ (페론 벡터라고 불림)가 존재하며,

||⋅||

이라고 명시한다. 여기서 $\epsilon$ 은 $π^{(0)}=u=1/ne$ 의 절대값에서 최대 고유값이며, 대수학자들은 이를 $π^{(k+1)}=π^{(k)}G$ 의 스펙트럼 반지름이라고 부른다.

💡

페론-프로베니우스 정리의 구체적인 내용을 설명. 정리에 따르면, 특정 조건을 만족하는 행렬은 유일한 고유 벡터(페론 벡터)를 가지며, 이 벡터는 모든 항목이 양수이고 합이 1인 특성을 가짐. 이는 PageRank 알고리즘에서 중요한 역할을 함.

→ 페론 벡터는 $∣∣π^{(k+1)}−π^{(k)}∣∣<ϵ$ 의 좌측 우세 고유벡터이며, 즉 크기 면에서 가장 큰 고유값과 연관된 좌측 고유벡터이다.

💡

페론 벡터가 고유값 문제에서 어떻게 위치하는지를 설명. 우세 고유벡터는 해당 행렬의 가장 큰 고유값에 대응하는 고유벡터를 의미. 이 벡터는 PageRank 알고리즘에서 웹 페이지의 중요도를 나타내는 데 사용 됨.

The matrix

∣∣⋅∣∣

is most likely reducible, since experiments have shown that the Web has a bow-tie structure fragmented into four main continents that are not mutually reachable, as first observed in. Thanks to the teleportation trick, however, the graph of matrix

\alpha^k

is strongly connected. Hence

\alpha

is irreducible and Perron-Frobenius theorem applies6. Therefore, a positive PageRank vector exists and is furthermore unique.

→ 행렬 $\alpha =0.85$ 는 아마도 기약==(상호 연결된 상태 —> dangling 노드만 해결, 버킷 문제 해결 X)이 아닐 것이다. 실험을 통해 웹이 상호 도달할 수 없는 네 개의 주요 대륙으로 구성된 나비넥타이 구조를 가지고 있음이 밝혀졌기 때문이다.

💡

웹의 구조가 '나비넥타이 모델'로 불리는 특정한 형태를 가지고 있음. 모델은 웹을 서로 도달할 수 없는 여러 부분으로 나뉨.

→ 그러나 이동(teleportation) 트릭 덕분에, 행렬 $G$ 의 그래프는 강하게 연결되어 있다.

💡

이동(teleportation) 트릭'이란 PageRank 알고리즘에서 사용되는 기법으로, 모든 페이지가 무작위로 다른 페이지로 '이동'할 수 있는 가정을 도입함으로써, 그래프의 모든 노드가 서로 도달 가능하도록 조작. 이로 인해 행렬 G의 그래프는 강하게 연결된(strongly connected) 상태.

→ 따라서 G는 기약이며, 페론-프로베니우스 정리가 적용된다.

💡

강하게 연결된 그래프는 기약 (irreducible) 행렬로 표현. 이는 행렬 G가 그래프의 모든 노드 간에 경로가 존재함을 의미하며, 따라서 페론-프로베니우스 정리를 적용 가능. 이 정리는 기약 비음수 행렬에 대해 유일한 비음수 고유 벡터의 존재를 보장.

→ 따라서, 양의 PageRank 벡터가 존재하며 또한 유일하다.

💡

페론-프로베니우스 정리에 따라, G 행렬은 유일한 양의 고유 벡터를 가진다. 이 고유 벡터는 웹 페이지들의 PageRank 점수를 나타내며, 이는 각 웹 페이지의 상대적 중요도를 나타내는 유일한 지표가 된다. 이러한 유일성은 웹 페이지 순위 결정에 있어 일관성과 정확성을 제공한다.

Interestingly, we can arrive at the same result using Markov theory. The above described random walk on the Web graph, modified with the teleportation jumps, naturally induces a finite-state Markov chain, whose transition matrix is the stochastic matrix

H

. Since

G

is irreducible, the chain has a unique stationary distribution corresponding to the PageRank vector.

→ 흥미롭게도, 마르코프 이론을 사용하여 같은 결과에 도달할 수 있다.

→ 위에서 설명된 웹 그래프 상의 랜덤 워크는, 이동(teleportation) 점프로 수정되어 자연스럽게 유한 상태 마르코프 체인을 유도하며, 이 체인의 전이 행렬은 확률 행렬 G이다.

💡

웹 페이지 간의 링크를 따라 무작위로 이동하는 과정(랜덤 워크)이 이동 점프(teleportation)를 포함하여 수정됨으로써, 이는 유한 상태 마르코프 체인을 형성.여기서 '전이 행렬'은 마르코프 체인의 각 상태 간 전이 확률을 나타내는 행렬이며, 이 경우에는 확률 행렬 G가 전이행렬이 됨.

→ G가 기약이기 때문에, 이 체인은 PageRank 벡터에 해당하는 유일한 정적 분포를 가진다.

💡

마르코프 체인에서, '정적 분포(stationary distribution)'는 시간이 지남에 따라 변하지 않는 체인(전이행렬)의 상태 분포를 의미. 행렬 G가 기약이라는 사실은, 마르코프 체인이 유일한 정적 분포를 가질 것임을 보장. 이 정적 분포는 바로 PageRank 벡터와 일치하며, 웹 페이지의 중요도를 나타냄.

A last crucial question remains: can we efficiently compute the PageRank vector? The success of PageRank is largely due to the existence of a fast method to compute its values: the power method, a simple iteration method to find the dominant eigenpair of a matrix developed by von Mises and Pollaczek-Geiringer. It works as follows on the Google matrix

α^k

. Let

α

Repeatedly compute

G

until

H

, where

H

measures the distance between the two successive PageRank vectors and

G

is the desired precision.

→ 마지막으로 중요한 질문이 남아있다: 효율적으로 PageRank vector를 계산할 수 있는가?

💡

이론적으로는 PageRank 벡터가 존재하고 유일하지만, 실제로 이를 효율적으로 계산하는 것이 가능한지에 대한 질문

→ PageRank의 성공은 그것의 값을 계산하기 위한 빠른 방법이 존재하기 때문이다: 파워 방법은, von Mises 와 Pollaczek-Geiringer에 의해 개발된 행렬의 우세한 고유쌍(고유값과 그에 해당하는 고유벡터)을 찾기 위한 간단한 반복 방법이다.

→ 이것은 구글 행렬 G에 다음과 같이 작동한다.

→ $L=(l_{i,j})$ 로 설정하라. $l_{i,j} = 1$ 를 재귀적으로 계산하여

i

이 될 때까지 수행한다. 여기서

j

는 두 연속적인 PageRank 벡터 사이의 거리를 측정하고, ϵ은 원하는 precision(정밀도)이다.

💡

초기 벡터 $l_{i,j} = 0$ 로 시작하여, Google 행렬 $L^T$ 를 이용해 반복적으로 벡터를 업데이트한다. 이 과정은 연속적인 두 PageRank 벡터 간의 거리가 특정 임계값 $L$ 보다 작아질 때까지 계속된다. 이 임계값은 계산의 정밀도를 결정한다. 파워 방법은 이러한 반복 계산을 통해 PageRank 벡터를 효율적으로 근사할 수 있게 해준다.

The convergence rate of the power method is approximately the rate at which

x

approaches to 0: the closer

y

to unity, the lower the convergence speed of the power method. If, for instance,

L=(l_{i,j})

, as many as 43 iterations are sufficient to gain 3 digits of accuracy, and 142 iterations are enough for 10 digits of accuracy.

Notice that the power method applied to matrix $i$ can be easily expressed in terms of matrix $j$ , which, unlike $l_{i,j} = 1$ , is a very sparse matrix that can be stored using a linear amount of memory with respect to the size of the Web.

→ 파워 방법의 수렴 속도는 대략 $l_{i,j} = 0$ 가 0에 접근하는 비율과 같다: α가 1에 가까울수록 파워 방법의 수렴 속도는 더 느려진다.

💡

파워 방법의 수렴 속도는 특정 변수 $L^T$ 의 값에 따라 달라짐. $L$ 가 1에 가까워질수록, 즉 $x$ 의 값이 크면 클수록 파워 방법은 더 많은 반복을 필요로 하여 수렴하는 데 더 오랜 시간을 필요로 함.

→ 예를 들어, $x)$ =0.85일 경우, 3자리의 정확도를 얻기 위해 43번의 반복이 충분하고, 10자리의 정확도를 얻기 위해서는 142번의 반복이 충분하다.

💡

$y$ 가 0.85인 경우, 비교적 적은 반복으로도 상당한 정확도를 달성할 수 있음을 나타냄. 파워 방법이 실제로 PageRank 벡터를 계산하는 데 효율적임을 의미.

→ 행렬 $(y)$ 에 적용된 파워 방법은 행렬 $k≥1$ 의 용어로 쉽게 표현될 수 있는데, $y^{(0)}=e$ 는 $y^{(0)}=e$ 와 달리 매우 희소한 행렬(sparse matrix)로, 웹의 크기에 비례하여 선형적인 양의 메모리를 사용하여 저장할 수 있다.

💡

이 문장은 행렬 $x$ 대신 행렬 $A=L^TL$ 를 사용하는 이점을 설명. $y$ 는 희소 행렬(sparse matrix)로, 데이터 저장 면에서 효율적. 희소 행렬은 대부분의 요소가 0인 행렬로, 웹과 같이 대규모 데이터를 다룰 때 메모리 사용을 최적화하는 데 유리. 이러한 특성 덕분에, 행렬 $H=LL^T$ 를 사용하면 웹의 크기에 비례하여 메모리를 효율적으로 사용 가능

Standing on the shoulders of giants

Dwarfs standing on the shoulders of giants is a Western metaphor meaning “One who develops future intellectual pursuits by understanding the research and works created by notable thinkers of the past”. The metaphor was famously uttered by Isaac Newton: “If I have seen a little further it is by standing on the shoulders of Giants”. Moreover, “Stand on the shoulders of giants” is Google Scholar’s motto: “the phrase is our acknowledgement that much of scholarly research involves building on what others have already discovered”.

→ "거인의 어깨 위에 서다"라는 서양의 비유는 "과거의 주목할 만한 사상가들이 만들어낸 연구와 작품을 이해함으로써 미래의 지적 탐구를 개발하는 사람"을 의미하는 말이다.

→ 이 비유는 아이작 뉴턴에 의해 유명하게 사용되었다. "내가 조금 더 멀리 볼 수 있었다면 그것은 거인의 어깨 위에 서 있었기 때문이다"라고 말했다.

→ 또한, "거인의 어깨 위에 서다"는 구글 스칼라(Google Scholar)의 모토입니다. 이 문구는 "학술 연구의 많은 부분이 이미 발견한 것들을 바탕으로 구축하는 것임을 인정하는 것"이라고 설명한다.

There are many giants upon whose shoulders PageRank firmly stands: Markov, Perron, Frobenius, von Mises and Pollaczek-Geiringer provided at the beginning of the 1900’s the necessary mathematical machinery to investigate and effectively solve the PageRank problem. Moreover, the circular PageRank thesis has been previously exploited in different contexts, including Web information retrieval, bibliometrics, sociometry, and econometrics. In the following, we review these contributions and link them to the PageRank method. Table 1 contains a brief summary of PageRank history. All the ranking techniques surveyed in this paper have been implemented in R and the code is freely available at the author’s Web page.

→ PageRank 알고리즘은 여러 학문적 거인들의 업적 위에 서 있다: Markov, Perron, Frobenius, von Mises and Pollaczek-Geiringer 는 1900년대 초반에 PageRank 문제를 조사하고 효과적으로 해결하기 위한 필요한 수학적 도구를 제공했다.

→ 또한, 순환적인 PageRank 논리는 웹 정보 검색, 문헌 계량학(bibliometrics), 사회 계량학(sociometry), 경제 계량학(econometrics) 등 다양한 맥락에서 이전에 이미 활용되었다.

→ 다음 파트에서 이러한 기여들을 검토하고 PageRank 방법에 연결시키겠다. 표 1은 PageRank 역사의 간략한 요약을 담고 있다. 논문에서 조사된 모든 순위 결정 기술들은 R 언어로 구현되었으며 코드는 저자의 웹 페이지에서 무료로 이용할 수 있다.

Year	Author	Contribution
1906	Markov	Markov theory [20]
1907	Perron	Perron theorem [24]
1912	Frobenius	Perron-Frobenius theorem [7]
1929	von Mises & Pollaczek-Geiringer	Power method [31]
1941	Leontief	Econometric model [18]
1949	Seeley	Sociometric model [29]
1952	Wei	Sport ranking model [32]
1953	Katz	Sociometric model [11]
1965	Hubbell	Sociometric model [10]
1976	Pinski & Narin	Bibliometric model [26]
1998	Kleinberg	HITS [14]
1998	Brin & Page	PageRank [4]

Table 1:PageRank history.

Hubs and authorities on the Web

Hypertext Induced Topic Search (HITS) is a Web page ranking method proposed by Jon Kleinberg. The connections between HITS and PageRank are striking. Despite the close conceptual, temporal and even geographical proximity of the two approaches, it appears that HITS and PageRank have been developed independently. In fact, both papers presenting PageRank and HITS are today citational blockbusters: the PageRank article collected 6167 citations, while the HITS paper has been cited 4617 times.

→ HITS는 Jon Kleinberg에 의해 제안된 Web page ranking 방법이다. HITS와 PageRank의 관계는 충격적이다. 비슷한 개념, 시간, 그리고 심지어 근접한 거리에도 불구하고 HITS와 PageRank는 독립적으로 개발되었다. 실제로 오늘 날 두 논문의 인용 수는 놀랍다. PageRank: 6167 인용, HITS: 4617이다.

💡

두 논문은 비슷한 시기에 발표되었지만, 독립적으로 개발되었고, 비슷한 개념으로 구현되었음. 추가적으로 두 논문의 인용 수 또한 매우 높은 것을 미루어 보아 웹 검색과 관련된 연구 분야에 큰 영향을 미친 것을 알 수 있음. 또한 웹 Topology만을 보는 것도 동일

HITS thinks of Web pages as authorities and hubs. HITS circular thesis reads as follows:

Good authorities are pages that are pointed to by good hubs and good hubs are pages that point to good authorities.

→ HITS는 웹페이지를 권위자(authority)과 저장소(hub)로 생각한다. HITS의 순환 이론은 다음에서 읽을 수 있다.

→ 좋은 권위자는 좋은 저장소에 의해 가르켜진 페이지이며 좋은 저장소는 좋은 권위자를 가르킨 페이지이다.

즉, 좋은 저장소는 좋은 권위자를 링크하고, 좋은 권위자는 좋은 저장소로부터 링크를 받는다.

💡

PageRank: 링크된 수, 링크 받은 수, 중요도 점수 → 중요도 점수 HITS: 권위자(받는페이지), 저장소(보내는페이지) → 중요도 점수 즉, Authorities와 Hubs는 상호 의존적인 관계 Hub는 네이버 같은?? 곳인가??

Let

x

be the adjacency matrix of the Web graph, i.e,

A=L^TL

if page

y

links to page

H=LL^T

and

G

otherwise. We denote with

A

the transpose of

H

. HITS defines a pair of recursive equations as follows, where

x^{(0)}=e

is the authority vector containing the authority scores and

x^{(k)} = Ax^{(k-1)}

is the hub vector containing the hub scores:

권위벡터 $x^{(k)}=x^{(k)} / m(x^{(k)})$	$m(x^{(k)})$	(2)
허브벡터 $x^{(k)}$	$x$	ㅤ

→ 웹 그래프의 인접 행렬을

m(x^{(k)})

로 하면, 즉, 만약 페이지

y

가 페이지

y=Lx

를 가르킨다면

A

이고, 그렇지 않을경우

x^{(0)}=e

이다.

x^{(k)} = Ax^{(k-1)}

는

x^{(k)}=x^{(k)} / m(x^{(k)})

의 전치 행렬이다. HITS는 다음과 같이 재귀수식 쌍을 정의한다. 여기서 $m(x^{(k)})$ 는 권위 점수를 포함하는 권위 벡터(

x^{(k)}

이고, $x^{(k)}$ 는 허브 점수를 포함하는 허브 벡터

x

이다:

💡

행렬

m(x^{(k)})

은

y

페이지가

y=Lx

페이지를 가르킨다면 1, 그렇지 않을 경우 0으로 구성됨.

A

는

H

의 전치행렬임. → 링크에 방향이 반대로 나타나게 됨 HITS에서는 PageRank와 다르게 중요도점수 대신 권위점수( $A$ , 허브점수( $H$ )를 사용하며 이 점수를 계산하기 위해 재귀 방정식을 사용함

where

L

and

U

, the vector of all ones. The first equation tells us that authoritative pages are those pointed to by good hub pages, while the second equation claims that good hubs are pages that point to authoritative pages.

Notice that Equation 2 is equivalent to:

권위벡터 $LL^T$	$V$	(3)
허브벡터 $L^TL$	$S$	ㅤ

→ 여기서 k는 1 이상이고, $L.$ 는 모든 요소가 1인 벡터이다. 첫 번째 수식은 좋은 허브 페이지에 의해 가리켜지는 페이지들이 권위 있는 페이지라고 말해주고, 두 번째 방정식은 좋은 권위 있는 페이지를 가리키는 페이지들이 좋은 허브라고 주장한다. 식 3과 식 2는 동일하다.

💡

첫 번째 방정식은 권위 점수를 계산하는 방식을 나타내며, 권위 있는 페이지는 많은 허브 페이지들로부터 링크를 받는 페이지. 두 번째 방정식은 허브 점수를 계산하며, 좋은 허브는 여러 권위 있는 페이지로 링크를 보내는 페이지.

It follows that the authority vector

L

is the dominant right eigenvector of the authority matrix

U

, and the hub vector

H = LL^T

is the dominant right eigenvector of the hub matrix

V

. This is very similar to the PageRank method, except the use of the authority and hub matrices instead of the Google matrix.

→ 따라서 권위 벡터

A = L^TL

는 권위 행렬

S

의 지배적인 오른쪽 고유벡터이며, 허브 벡터

A

는 허브 행렬

H

의 지배적인 오른쪽 고유벡터이다. 이는 구글 행렬 $(x, y)$ 대신 권위행렬 $L$ 와 허브 행렬 $A = L^TL$ 를 사용하는 것을 제외하고는 PageRank 방법과 매우 유사하다.

💡

HITS 알고리즘에서 권위 벡터 $a_{i,j}$ 와 허브 벡터 $i$ 가 어떻게 계산되는지 설명. 권위 벡터 $j$ 는 권위 행렬 $h_{i,j}$ , 즉 인접 행렬

i

와 $A$ 의 곱인 $H$ 의 고유벡터. 허브 벡터 $A=L^TL$ 는 허브 행렬 $a_{i,j}$ , 즉 $i$ 의 고유벡터. 고유벡터는 해당 행렬의 고유값 중 최대 고유값에 대응하는 고유벡터를 의미 HITS, PageRank 모두 웹 페이지 간의 링크 구조를 기반으로 중요도를 계산 주요 차이점은 HITS가 권위행렬 $j$ 허브 행렬 $h_{i,j}$ 을 사용하는 반면, PageRank는 구글 행렬 $i$ 이라고 불리는 전이 확률 행렬을 사용. PageRank는 모든 웹 페이지의 중요도를 개별적으로 평가 HITS는 페이지를 권위와 허브라는 두 가지 관점에서 평가

To compute the dominant eigenpair (eigenvector and eigenvalue) of the authority matrix we can again exploit the power method as follows: let

A

. Repeatedly compute

H

) and normalize

T_1

, where

T_2

is the signed component of maximal magnitude, until the desired precision is achieved. It follows that

i

converges to the dominant eigenvector

j

(the authority vector) and

i

converges to the dominant eigenvalue (the spectral radius, which is not necessarily 1). The hub vector

T_1

is then given by

j

. While the convergence of the power method is guaranteed, the computed solution is not necessarily unique, since the authority and hub matrices are not necessarily irreducible. A modification similar to the teleportation trick used for the PageRank method can be applied to HITS to recover the uniqueness of the solution.

→ 권위 행렬

T_2

의 고유쌍(고유벡터와 고유값)을 계산하기 위해 다시 파워 방법을 다음과 같이 활용할 수 있다: $c_{i,j}$ 로 둔다. 재귀적으로 $i$ 를 계산하고 $j$ 로 정규화한다. 여기서 $c_i = \Sigma_jc_{i,j}$ 는

i

의 요소 중 절대 값이 가장 큰 값(즉, 고유 값 중 가장 큰 값)이며, 원하는 precision에 도달할 때까지 이를 수행한다.

따라서 $T_1$ 는 고유벡터 $T_2$ (권위 벡터)로 수렴하고 $T_1$ 는 고유값(스펙트럼 반지름, 반드시 1은 아님)으로 수렴한다. 그런 다음 허브 벡터 $i$ 는 $T_2$ 로 주어진다.

파워 방법의 수렴은 보장되지만, 계산된 해가 반드시 고유하지는 않다. 왜냐하면 권위와 허브 행렬이 절대로 irreducible이 아니기 때문이다.

PageRank 방법에 사용되는 텔레포테이션 트릭과 유사한 수정이 HITS에 적용될 수 있으며, 이를 통해 해의 고유성을 회복할 수 있다.

💡

파워 방법을 통해 고유벡터를 반복적으로 계산하고 정규화하는 과정을 설명 $j$ 는 모든 요소가 1인 초기 벡터 정규화 과정에서 사용되는 $i$ 는 $j$ 벡터의 요소 중 절대값이 가장 큰 요소 → 계산된 고유벡터의 크기를 1로 조정 파워 방법을 사용한 결과, 계산된 벡터 $i$ 가 권위 벡터 $j$ 로 수렴하고, 해당 벡터의 최대 요소의 값 $c_{i,j}$ 이 권위 행렬의 고유값으로 수렴한다는 것을 설명 → cf: 고유값이 반드시 1이 아닐 수 있다

계산된 권위 벡터 $c_i = \Sigma_jc_{i,j}$ 를 사용하여 허브 벡터 $i$ 를 계산하는 방법을 설명.

인접 행렬 $i$ 과 권위 벡터 $H=(h_{i,j})$ 의 곱으로 허브 벡터 $h_{i,j} = c_{i,j}/c_j$ 를 얻는 과정.

→

h_{i,j}

파워 방법이 수렴하지만 권위와 허브 행렬의 특성 때문에 계산된 해가 항상 유일하지는 않다 → reducible 하기 때문 PageRank 알고리즘에서 사용되는 텔레포테이션(임의 페이지로의 점프 확률 추가)과 유사한 기법을 HITS 알고리즘에도 적용하여 해의 유일성을 확보할 수 있다.

Figure 2: A HITS instance with solution (compare with PageRank scores in Figure 1). Each node is labelled with its authority (top) and hub (bottom) scores. Scores have been normalized to sum to 100. The dominant eigenvalue for both authority and hub matrices is 10.7.

→해결책이 있는 HITS 예시 (그림 1의 PageRank 점수와 비교). 각 노드는 그것의 권위(위)와 허브(아래) 점수로 표시되어 있다. 점수들은 합계가 100이 되도록 정규화되었다. 권위 및 허브 행렬 모두에 대한 지배적인 고유값은 10.7이다.

💡

HITS 알고리즘의 결과를 나타내는 그림에 대한 설명이다. 각 노드(웹 페이지)는 두 개의 점수로 표시: 권위 점수와 허브 점수.

An example of HITS is given in Figure 2. We stress the difference among importance, as computed by PageRank, and authority and hubness, as computed by HITS. Page B is both important and authoritative, but it is not a good hub. Page C is important but by no means authoritative. Pages G, H, I are neither important nor authoritative, but they are the best hubs of the network, since they point to good authorities only. Notice that the hub score of B is 0 although B has one outgoing edge; unfortunately for B, the only page C linked by B has no authority. Similarly, C has no authority because it is pointed to only by B, whose hub score is zero. This shows the difference between indegree and authority, as well as between outdegree and hubness. Finally, we observe that nodes with null authority scores (respectively, null hub scores) correspond to isolated nodes in the graph whose adjacency matrix is the authority matrix

j

(respectively, the hub matrix

i

→ HITS의 예시가 그림 2에 제시되어 있다. PageRank가 계산하는 중요성과 HITS가 계산하는 권위 및 허브성 사이의 차이를 강조한다.

페이지 B는 중요하고 권위가 있지만, 좋은 허브는 아니다. 페이지 C는 중요하지만 결코 권위가 있는 것은 아니다. 페이지 G, H, I는 중요하지도 권위가 있지도 않지만, 좋은 권위자들만을 가리키기 때문에 네트워크의 최고의 허브들이다. B에는 하나의 나가는 링크가 있음에도 불구하고 B의 허브 점수는 0이다; 불행히도 B에 의해 연결된 유일한 페이지 C는 권위가 없다.

마찬가지로, C는 B에 의해서만 가리켜지는데 B의 허브 점수가 0이기 때문에 권위가 없다.

이는 들어오는 차수와 권위, 그리고 나가는 차수와 허브성 사이의 차이를 보여준다.

마지막으로, 권위 점수(또는 허브점수)가 없는 노드들은 권위 행렬 $j$ (또는 허브 행렬 $\pi_j$ )의 인접 행렬을 갖는 그래프에서 고립된 노드들에 해당한다는 것을 알 수 있다.

💡

PageRank와 HITS가 계산하는 값들 사이의 기본적인 차이를 설명. PageRank는 페이지의 전반적인 중요성을 계산 HITS는 페이지의 권위 점수와 허브점수를 각각 계산 → 권위 점수 또는 허브 점수가 없는 노드들은 고립된 노드임

An advantage of HITS with respect to PageRank is that it provides two scores at the price of one. The user is hence provided with two rankings: the most authoritative pages about the research topic, which can be exploited to investigate in depth a research subject, and the most hubby pages, which correspond to portal pages linking to the research topic from which a broad search can be started. A disadvantage of HITS is the higher susceptibility of the method to spamming: while it is difficult to add incoming links to our favourite page, the addition of outgoing links is much easier. This leads to the possibility of purposely inflating the hub score of a page, indirectly influencing also the authority scores of the pointed pages.

→ HITS의 PageRank에 대한 장점은 하나의 가격으로 두 개의 점수를 제공한다는 것이다. 사용자는 따라서 두 가지 순위를 제공받는다: 연구 주제에 대한 가장 권위 있는 페이지들은 연구 주제를 깊이 조사하는데 활용될 수 있으며, 가장 허브성이 높은 페이지들은 연구 주제에 대한 링크를 제공하는 포털 페이지에 해당하며 이를 통해 광범위한 검색을 시작할 수 있다.

HITS의 단점은 메서드가 스팸에 더 취약하다는 것이다: 우리가 좋아하는 페이지에 들어오는 링크를 추가하는 것은 어렵지만, 나가는 링크를 추가하는 것은 훨씬 쉽다. 이는 페이지의 허브 점수를 의도적으로 부풀리는 가능성을 이끌어내며, 간접적으로 가리키는 페이지들의 권위 점수에도 영향을 줄 수 있다.

💡

HITS는 PageRank와 다르게 두 가지(권위 점수, 허브 점수) 점수를 제공 → 권위 있는 페이지: 특정 주제에 대한 심층적인 정보를 제공 → 허브 페이지: 여러 관련 페이지로의 링크를 제공하여 넓은 범위의 탐색 용이 단점: HITS 알고리즘은 의도적으로 많은 나가는 링크를 추가함으로써 허브 점수를 인위적으로 높일 수 있음, 그 후 권위 점수에도 영향을 미칠 수 있음. → 스팸이나 조작에 취약

HITS is related to a matrix factorization techniqe known as singular value decomposition(SVD). According to this technique, the adjacency matrix

j

can be written as the matrix product USV, where the columns of

H=(h_{i,j})

, called left-singular vectors, are the orthonormal eigenvectors of the hub matrix

h_{i,j} = c_{i,j}/c_j

, the columns of

h_{i,j}

, called right-singular vectors, are the orthonormal eigenvectors of the authority matrix

j

, and

i

is a diagonal matrix whose diagonal elements, called singular values, correspond to the square roots of the eigenvalues of the hub matrix (or, equivalently, of the authority matrix). It follows that the HITS authority and hub vectors correspond, respectively, to the right- and left-singular vectors associated with the highes singular value of

j

→HITS는 특이값 분해(SVD)라고 알려진 행렬 분해 기술과 관련이 있다.

특이값 분해(SVD)따르면, 인접 행렬 $i$ 은 행렬 곱 USVT로 표현될 수 있는데, 여기서 $\pi_j$ 의 열(좌-특이 벡터라고 불림)은 허브 행렬 $j$ 의 직교 정규 고유벡터이고, $\pi_jc_j$ 의 열(우-특이 벡터라고 불림)은 권위 행렬 $\pi_j$ 의 직교 정규 고유벡터이며, $j$ 는 대각선 요소(특이값이라고 불림)가 허브 행렬(또는 그에 상응하는 권위 행렬)의 고유값의 제곱근에 해당하는 대각 행렬이다.

따라서 HITS의 권위

c_j

및 허브

\pi_jc_j

벡터

j

는 각각 $c_j$ 의 가장 높은 특이값과 관련된 우-특이 벡터와 좌-특이 벡터에 해당한다.

💡

HITS 알고리즘이 특이값 분해, 즉 행렬을 특정한 구성 요소로 분해하는 기법과 관련이 있다 인접 행렬 $h_{i,j}$ 은 세 개의 행렬 U, S, V의 곱으로 분해 가능 - $j$ 의 열은 허브행렬 $\pi_j$ 의 직교 정규 고유벡터 - $c_j$ 의 열은 권위행렬 $H$ 의 직교 정규 고유벡터 - $H$ 는 특이값을 대각선에 가지는 행렬 →특이값은 권위 행렬

H

과 허브 행렬

i

의 고유값의 제곱근에 해당 $j$ 의 가장 높은 특이값과 관련된 우-특이 벡터와 좌-특이 벡터가 각각 권위와 허브 벡터에 해당 - 권위 벡터

i

j

의 우 특이 벡터 - 허브 벡터

i

j

의 좌 특이 벡터

HITS also has a connection to bibliometrics. Tow typical bibliometric methods to identify similar publications are co-citation, in which publications are related when they are cited by the same papers, and co-reference, in which papers are related when they cite the same papers. The authority matrix is a co-citation matrix and the hub matrix is a co-reference matrix. Indeed, since

i

, the element

j

of the authority matrix contains the number of times pages

i

and

i

are both linked by a third page (

j

is the number of outlinks of

i

). Hence, good authorities are pages that are frequntly co-cited with other good hubs.

→ HITS는 또한 학술 문헌 계량학과 연결이 있다.

유사한 출판물을 식별하는 두 가지 일반적인 학술 문헌 계량학 방법은

1) 공동 인용(co-citation), 같은 논문들에 의해 인용될 때 출판물들이 연관되는 경우

2) 공동 참조(co-reference), 같은 논문들을 인용할 때 논문들이 연관되는 경우 이다.

권위 행렬 $j$ 은 공동 인용 행렬(co-citation)이고,

허브 행렬 $i$ 은 공동 참조 행렬(co-reference)이다.

실제로, $j$ 이기 때문에, 권위 행렬의 요소 $i$ 는 페이지 $j$ 와 $i$ 가 세 번째 페이지에 의해 공동으로 연결된 횟수를 포함한다 ( $j$ 는 $i$ 의 아웃링크 수이다). 좋은 권위자

i

들은 다른 좋은 허브

j

들과 자주 공동으로 인용되는 페이지들이다.

💡

HITS 알고리즘이 학술 문헌 계량학, 즉 학술 문헌의 특성과 패턴을 계량적으로 분석하는 분야와 관련이 있음 유사한 출판물을 식별하는 두 가지 일반적인 학술 문헌 계량학 방법 1) co-citation: 서로 다른 출판물들이 동일한 논문에 의해 인용될 때 → 권위행렬

L=(l_{i,j})

2) co-reference: 서로 다른 논문들이 동일한 출판물들을 인용할 때 → 허브 행렬

l_{i,j} = 1

$i$ 는 두 페이지가 서로 다른 페이지에 의해 얼마나 자주 공동으로 인용되었는지를 나타냄 → 좋은 권위자는 다른 중요한 페이지들과 자주 공동으로 인용되는 페이지

A following algorithm that incorporates ideas from both PageRank and HITS is SALSA: like HITS, SALSA computes both authority and hub scores, and like PageRank, these scores are obtained from Markov chains.

→ PageRank와 HITS의 아이디어를 모두 포함하는 다음 알고리즘은 SALSA인데, HITS처럼 SALSA는 권위와 허브 점수를 모두 계산하고, PageRank처럼 이러한 점수들은 마르코프 체인에서 얻어진다.

💡

SALSA = PageRank(마르코프체인) + HITS(권위점수, 허브점수) 즉, 권위점수와 허브점수를 구하는데, 이를 마르코프 체인을 기반으로 얻는다.

Bibliometrics ~ Conclusion

Bibliometrics(서지학 계랑)

Bibliometrics, also known as scientometrics, is the quantitative study of the process of scholarly publication of research achievements. The most mundane aspect of this branch of information and library science is the design and application of bibliometric indicators to determine the influence of bibliometric units like scholars and academic journals. The Impact Factor is, undoubtedly, the most popular and controversial journal bibliometric indicator available at the moment. It is defined, for a given journal and a fixed year, as the mean number of citations in the year to papers published in the two previous years. It has been proposed in 1963 by Eugene Garfield, the founder of the Institute for Scientific Information (ISI), working together with Irv Sher. Journal Impact Factors are currently published in the popular Journal Citation Reports by Thomson-Reuters, the new owner of the ISI.

→ 학술 문헌 계량학, 과학계량학으로도 알려져 있으며, 연구 성과의 학술 출판 과정에 대한 정량적 연구이다.

정보 및 도서관 과학 분야의 가장 일반적인 측면은 학자(Scholars)와 학술 저널(academic journal)과 같은 학술 문헌(bibliometrics) 단위(units)의 영향력을 결정하기 위한 학술 문헌 지표의 설계 및 적용이다.

임팩트 팩터(IF)는 의심할 여지 없이 현재 사용 가능한 가장 인기 있고 논란이 많은 저널 학술 문헌 지표(bibliometrics indicator)이다.

IF는 주어진 저널과 고정된 연도(ex. 2023년)에 대해, 이전 두 해에 출판된 논문들이 그 해에 받은 인용의 평균 수로 정의된다.

IF는 1963년에 과학정보연구소(ISI) 설립자인 Eugene Garfield에 의해 제안되었으며, Irv Sher와 함께 작업하였다.

저널 IF는 현재 ISI의 새로운 owner인 Thomson-Reuters가 발행하는 The Poplular Journal Citation Reports에서 발표된다.

💡

학술 문헌 계량학(Biblometrics): 연구 성과와 학술 출판의 과정을 수량화하여 분석하는 학문 IF는 학술 저널의 영향력을 측정하는 가장 널리 알려진 지표 → 저널의 영향력과 중요도를 평가 + 여러 논란의 대상 → 특정 저널의 논문이 최근 두 해 동안 받은 인용 횟수의 평균으로 계산 ex) 2023년도 기준 A 논문이 2021년도에 10개 인용 2022년도에 20개 인용을 받았다면 —> $j$ , 즉 IF = 15?? IF는 '저널 인용 보고서(Journal Citation Reports)'를 통해 공개 문제제기) IF는 왜 논란일까?

The Impact Factor does not take into account the importance of the citing journals: citations from highly reputed journals are weighted as those from obscure journals. In 1976 Gabriel Pinski and Francis Narin developed an innovative journal ranking method [26]. The method measures the influence of a journal in terms of the influence of the citing journals. The Pinski and Narin thesis is:

A journal is influential if it is cited by other influential journals.

→ IF는 인용 저널의 중요성을 고려하지 않는다: 평판이 좋은 저널의 인용은 잘 알려지지 않은(애매한) 저널의 인용과 동일하게 가중치가 부여된다. 1976는 Gabriel Pinski와 Francis Narin은 혁신적인 저널 ranking method를 개발했다. 그 방법은 인용한 저널의 영향력 측면에서 저널의 영향력을 측정한다. Prinski와 Narin의 이론은 다음과 같다.

저널은 다른 영향력 있는 저널에서 인용될 경우 영향력이 있다.

💡

문제: IF는 인용하는 저널의 중요성을 고려하지 않는다는 한계를 가지고 있다. 즉, 인용 횟수만을 기준으로 삼기 때문에, 높은 평판을 가진 저널과 무명 저널의 인용이 동일하게 취급. 해결책: 1976년에 가브리엘 피스키와 프랜시스 나린은 새로운 저널 순위 결정 방법을 개발 → 저널이 인용된 저널의 영향력을 기반으로 저널의 영향력을 측정 즉, 영향력 있는 저널들에 의해 자주 인용되는 저널은 그 자체로도 영향력이 있다

This is the same circular thesis of the PageRank method. Given a source time window

l_{i,j} = 0

and a previous target time window

W = \Sigma_{k=1}^\infin (aL)^k

, the journal citation system can be viewed as a weighted directed graph in which nodes are journals and there is an edge from journal

a

to journal

(i,j)

if there is some article published in

L^k

during

k

that cites an article published in

i

during

j

. The edge is weighted with the number

a^k

of such citations from

W

(i,j)

. Let

W

be the total number of cited references of journal

i

→ 이 방법은 PageRank의 순환이론과 동일하다.

source time window

j

과 이전 target time window

\pi_j = \Sigma_iw_{i,j}

가 주어지면, 저널 인용 시스템은 노드가 저널이고 $j$ 동안 $a<1/p(L)$ 저널에 출판된 어떤 논문이 $p(L)$ 동안 $L$ 저널에 출판된 논문을 인용할 경우 저널 $W$ 에서 저널 $L = (l_{i,j})$ 로의 간선이 있는 가중치가 있는 방향 그래프로 볼 수 있다.

간선은 $i$ 에서 $j$ 로의 인용 수 $l_{i,j} = 1$ 에 의해 가중된다.

W = \Sigma_{k=1}^\infin (aL)^k

를 저널 $a$ 에서 총 인용된 reference 수(=저널 $L^k$ 에서 인용한 reference 수?)로 하자.

💡

저널 인용 시스템을 시간에 따라 구조화된 방향 그래프로 설명. 저널은 노드로 표현되며 저널 간의 인용 관계는 가중치가 있는 간선으로 표현 → 저널 : 노드 = 인용관계 : 간선 → 가중치(간선)은 각 저널 간의 인용 횟수에 의해 결정 저널 $(i,j)$ 가 인용한 총 참조 문헌의 수를 나타내는 방법 $i$ : 저널 $j$ 에서 인용한 모든 다른 저널의 논문 수의 합

In the method described by Pinski and Narin, a citation matrix

k

is constructed such that

W

. The coefficient

a^k

is the amount of citations received by journal

W

from journal

(i,j)

per reference given out by journal

i

. For each journal an influence score is determined which measures the relative journal performance per given reference. The influence score

j

of journal

\pi_j = \Sigma_iw_{i,j}

is defined as:

ㅤ

j

ㅤ

or, in matrix notation:

ㅤ

a

ㅤ

(4)

→ Pinski와 Narin에 설명에 따르면, 인용 행렬 $1/p(L)$ 이 구축되며, $p(L)$ 로 정의된다.

계수 $L$ 는 저널 $W$ 가 저널

\alpha=0.9

로 부터 받은 인용 수 당 저널

\alpha=0.1

로 부터 나간 총 인용 수이다.

각 저널에 대해 주어진 참조 당 상대적 저널 성능을 측정하는 영향력 점수가 결정된다.

저널 $L$ 의 영향력 점수 $\alpha=0.9$ 는 다음과 같이 정의된다:

💡

인용 행렬 $\alpha=0.1$ : 각 저널 간의 인용 관계를 나타냄

a = 0.9

: 저널

a = 0.1

가 저널

L = 1

로부터 받은 인용 대비 저널

a = 0.9

가 주는 총 참조 수 → 저널 $a=0.1$ 가 주는 총 참조 수 $p(L)$ 로 나눈 저널 $1/p(L)$ 로의 인용 수 $W=(w_{i,j})$ $w_{i,j}$ 계수가 저널 $i$ 가 저널 $j$ 에 주는 인용의 상대적 중요도를 나타내는 것으로, 저널 $v$ 가 저널

v_i

에게 발행한 참조 당 저널 $i$ 에서 발행한 총 참조 수 → if, 저널 $W=(w_{i,j})$ 가 발행한 총 참조 수가 많으면(참조를 남발하면?) $w_{i,j}$ 의 값은 낮아짐????? → 반대로, 저널 $i$ 가 발행한 총 참조수가 적으면, $j$ 의 값은 높아짐?? 즉, 저널 $v$ 가 발행한 총 참조 수와 $v_i$ 는 반비례함???? 각 저널의 영향력 점수를 계산 → 주어진 참조당 저널의 상대적 영향력을 측정

Hence, journals

i

with a large total influence

\pi(I-W)=v

are those that receive significant endorsements from influential journals. Notice that the influence per reference score

I

of a journal

\pi = v(I-W)^{-1} = v \Sigma_{i=0}^{\infin} W^i

is a size independent measure, since the formula normalizes by the number of cited references

W

contained in articles of the journal, which is an estimation of the size of the journal. Moreover, the normalization neutralizes the effect of journal self-citations, that are citations between articles in the same journal. These citations are indeed counted both at the numerator and at the denominator of the influence score formula. This avoids over inflating journals that engage in the practice of opportunistic self-citations.

→ 따라서, 큰 총 영향력

v

을 가진 저널 $\pi = e \Sigma_{i=1}^{\infin} (aL)^i$ 는 영향력 있는 저널들로부터 중요한 지지를 받는 저널이다. ⇒ 왜??? $e$ 가 커지면 $\pi(I-W)=v$ 값이 작아지는거 아닌가? → 그냥 정규화다??

저널 $I$ 의 참조당 영향력 점수 $\pi = v(I-W)^{-1} = v \Sigma_{i=0}^{\infin} W^i$ 는 저널의 논문에 포함된 인용된 참조 수 $W$ 로 정규화하는 수식 때문에 크기에 독립적인 측정치이다. 이는 저널의 크기 추정치이다.

또한, 정규화는 같은 저널 내의 논문들 사이의 자기 인용의 효과를 중화시킨다.

이러한 인용들은 실제로 영향력 점수 수식의 분자와 분모 양쪽에서 모두 계산된다.

이는 기회주의적 자기 인용에 참여하는 저널의 과대 평가를 피한다.

💡

큰 총 영향력 점수를 가진 저널이 다른 영향력 있는 저널들로부터 중요한 인정(인용)을 받는 저널임을 의미 → 즉, 저널의 영향력은 그 저널이 받는 인용의 질과 양에 의해 결정 저널의 영향력 점수가 저널의 크기에 의존 X → 영향력 점수는 저널의 논문에 포함된 인용된 참조 수로 정규화기 때문 정규화 과정이 저널의 자기 인용(자체 논문들 간의 인용)의 영향을 줄인다 → 저널이 자체 논문들을 과도하게 인용하는 것에 대한 영향을 완화 자기 인용은 영향력 점수 계산에서 분자(인용 횟수)와 분모(총 참조 수) 모두에 포함되므로, 이러한 인용의 영향이 상쇄 저널이 자신의 논문을 인용하여 자체 순위를 인위적으로 높이는 행위를 방지 → 스패머 방지와 유사

It can be proved that the spectral radius of matrix

v

is 1, hence the influence score vector corresponds to the dominant eigenvector of

\pi = e \Sigma_{i=1}^{\infin} (aL)^i

. In principle, the uniqueness of the solution and the convergence of the power method to it are not guaranteed. Nevertheless, both properties are not difficult to obtain in real cases. If the citation graph is strongly connected, then the solution is unique. When journals belong to the same research field, this condition is typically satisfied. Moreover, if there exists a self-loop in the graph, that is an article that cites an article in the same journal, then the power method converges.

→ 따라서 행렬

e

의 spectral 반경이 1임이 증명될 수 있으므로⇒(왜?: Stochastic Matrix라서??) , 영향 점수 벡터가 dominant 고유벡터에 대응된다. 원칙적으로 해의 유일성이나 power method의 수렴은 보장되지 않는다⇒(왜? reducible 할 수도 있기때문??).

그럼에도 불구하고 두 속성들은 모두 실제 사례에서 얻는것은 어렵지 않다. 만약 인용그래프가 강하게 연결되어 있으면, 해는 유일하다. 저널이 동일한 연구분야에 속해있으면 이 조건은 일반적으로 만족된다. 게다가 만약 그래프에 자기 순환노드가 존재할 경우 즉 동일한 저널의 논문을 인용하는 논문이 있으면 거듭제곱 방법이 수렴한다.⇒(왜???)

💡

$a_{i,j}$ 의 가장 큰 고유값이 1 == 영향력 점수 계산이 안정적이라는 것을 의미 해의 유일성과 수렴성은 이론적으로는 보장되지 않지만, 실제 인용 그래프가 강하게 연결된 경우(즉, 모든 저널이 서로 상호 작용하는 경우 == irreducible 할 경우)에는 일반적으로 가능. → 특히, 자기 루프가 존재하는 경우(예를 들어, 저널이 자신의 다른 논문을 인용하는 경우) 파워 방법은 해에 수렴. ⇒ 왜??? 이전 해의 수렴 조건은 Stochastic Matrix인가 였던거 같은디???? → 이는 실제 학술 저널 네트워크에서 흔히 발견되는 특성

Figure 3:An instance with solution of the journal ranking method proposed by Pinski and Narin. Nodes are labelled with influence scores and edges with the citation flow between journals. Scores have been normalized to sum to 100.

→ 그림 3 :Pinski와 Narin이 제안한 저널 순위 방법의 솔루션을 사용한 사례이다. 노드에는 영향 점수가 표시되고 저널 간의 인용 흐름이 있는 간선이 표시된다. 점수는 합이 100으로 정규화되었다.

💡

노드 내 스코어: 영향 점수 간선: 저널 간의 인용 흐름

Figure 3 provides an example of the Pinski and Narin method. Notice that the graph is strongly connected and has a self-loop, hence the solution is unique and can be computed with the power method. Both journals A and C receive the same number of citations and give out the same number of references. Nevertheless, the influence of A is bigger, since it is cited by a more influential journal (B instead of D). Furthermore, A and D receive the same number of citations from the same journals, but D is larger than A, since it contains more references, hence the influence of A is higher.

Similar recursive methods have been independently proposed by [19] and [23] in the context of ranking of economics journals. Recently, various PageRank-inspired bibliometric indicators to evaluate the importance of journals using the academic citation network have been proposed and extensively tested: journal PageRank [3], Eigenfactor [34], and SCImago [28].

→ Figure 3은 Pinski와 Narin의 방법의 예이다.

그래프는 강하게 연결되어 있고(==irreducible하다 → 해의 유일성 만족) 자기 루프(⇒ 자기 루프를 갖고 있는게 왜 파워 방법을 사용할 수 있는거지?? 재귀 계산이 되어서? 그렇다면 자기루프를 갖는다면 왜 재귀 계산이 되는거지?)를 가지고 있기 때문에 해는 유일하며 파워 방법을 사용하여 계산할 수 있다.

저널 A와 C는 동일한 수의 인용을 받고 같은 수의 참조를 제공한다. 그럼에도 불구하고 A의 영향력이 더 크다. 이는 A가 더 영향력 있는 저널(D 대신 B)에 의해 인용되기 때문이다. 또한, A와 D는 동일한 저널로부터 동일한 수의 인용을 받지만, D는 A보다 더 많은 참조를 포함한다. 따라서 A의 영향력이 더 높다.

비슷한 순환적 방법은 경제 저널의 순위를 결정하는 맥락에서 독립적으로 [19]와 [23]에 의해 제안되었다. 최근에는 학술 인용 네트워크를 사용하여 저널의 중요성을 평가하는 다양한 PageRank 영감의 학술 문헌 지표가 제안되고 광범위하게 테스트되었다: 저널 PageRank [3], Eigenfactor [34], 및 SCImago [28].

💡

저널 간의 인용 관계는 영향력 평가에 중요한 요소이며, 인용을 많이 받는 것뿐만 아니라 누구에게 인용되는지도 중요 유사한 다른 방법들이 경제학 저널 평가에도 사용

Sociometry

Sociometry, the quantitative study of social relationships, contains remarkably old PageRank predecessors. Sociologists were the first to use the network approach to investigate the properties of groups of people related in some way. They devised measures like indegree, closeness, betweeness, as well as eigenvector centrality which are still used today in modern (not necessarily social) network analysis. In particular, eigenvector centrality uses the same central ingredient of PageRank applied to a social network:

A person is prestigious if he is endorsed by prestigious people.

→ 사회계량학(Sociometry), 즉 사회적 관계의 양적 연구는 PageRank의 이전 방법들을 포함한다.

사회학자들은 처음으로 네트워크 접근 방식을 사용하여 어떤 방식으로든 관련된 사람들의 집단의 특성을 조사했다.

그들은 인디그리(indegree), 근접성(closeness), 매개 중심성(betweenness) 및 고유벡터 중심성(eigenvector centrality)과 같은 척도를 고안했으며 이들은 오늘날에도 현대(반드시 사회적이지 않은) 네트워크 분석에서 여전히 사용된다.

특히, 고유벡터 중심성(eigenvector centrality)은 사회 네트워크에 적용된 PageRank의 중심 요소와 동일하다:

유명한 사람들에 의해 지지받는 경우 그 사람은 명성이 있다.

💡

사회 네트워크 분석에서 사용되는 고유벡터 중심성(eigenvector centrality이 PageRank 알고리즘과 유사한 원리를 사용한다 고유벡터 중심성(eigenvector centrality)은 사회 네트워크 내에서 사람의 중요성을 결정하는 데 사용되며 그 사람이 연결된 다른 중요한 사람들에 의해 얼마나 영향을 받는지를 표현한다.

John R. Seeley in 1949 is probably the first in this context to use the circular argument of PageRank. Seeley reasons in terms of social relationships among children: each child chooses other children in a social group with a nonnegative strength. The author notices that the total choice strengths received by each children is inadequate as an index of popularity, since it does not consider the popularity of the chooser. Hence, he proposes to define the popularity of a child as a function of the popularity of those children who chose the child, and the popularity of the choosers as a function of the popularity of those who chose them and so in an “indefinitely repeated reflection”. Seeley exposes the problem in terms of linear equations and uses Cramer’s rule to solve the linear system. He does not discuss the issue of uniqueness.

→ 1949년 John R. seeley는 아마 PageRank의 순환 주장을 이 분야에 사용한 최초의 사람일 것이다.

Seeley는 아이들 간의 사회적 관계에 대해 추론한다:

각 아이는 사회적 그룹 내의 부정적이지 않은 힘을 갖고 있는 다른 아이들을 선택한다. 저자는 각 아이에게 받은 총 선택 힘이 인기도의 지표로 적합하지 않다는 것을 지적한다.

왜냐하면 이는 선택한 사람의 인기를 고려하지 않기 때문이다.

따라서, 아이의 인기도를 그 아이(A)를 선택한 아이(B)의 인기도의 함수로 정의하고 선택한 아이(B)의 인기도를 그 아이(A)를 선택한 아이들의 인기도의 함수로 정의하는 것을 제안한다. 이렇게 되면 "무한히 반복되는 반사"로 된다. Seeley는 문제를 선형 방정식의 형태로 제시하고 크래머의 규칙을 사용하여 선형 시스템을 해결한다. 해의 유일성에 대해서는 논의하지 않는다.

💡

Seeley의 연구는 어린이들 사이의 인기도를 평가하는 데 순환적 논리를 사용 어린이의 인기도가 단순히 그들을 선택한 어린이들의 수에 의해 결정되는 것이 아니라, 그 선택한 어린이들의 자체 인기도에 의해 영향을 받는다는 아이디어를 제시 → PageRank 알고리즘이 웹 페이지의 중요도를 평가하는 방식과 유사

Another model is proposed in 1953 by Leo Katz. Katz views a social network as a directed graph where nodes are people and person

i

is connected by an edge to person

j

a_{i,j}

chooses, or endorses,

j

. The status of member

i

is defined as the number of weighted paths reaching

i

in the network, a generalization of the indegree measure. Long paths are weighted less than short ones, since endorsements devalue over long chains. Notice that this method indirectly takes account of who endorses as well as how many endorse an individual: if a node

j

points to a node

a_{i,j}

and

j

is reached by many paths, then the paths leading to

i

arrive also at

a_{i,j}

in one additional step.

→ 또 다른 모델은 1953년에 Leo Katz에 의해 제안되었다. Katz는 각 노드를 사람으로 보고 사람

q_{i,j}

가 사람

i

를 선택하거나 지지하는 경우

j

에서

q_i

로의 간선으로 연결된 방향 그래프로 보았다.

$i$ 의 상태(지위)는 네트워크에서

q_i = \Sigma_j q_{i,j}

에 도달하는 가중 경로의 수로 정의된다. 이는 인디그리(indegree) 측정의 일반화이다. 긴 경로는 짧은 경로보다 적게 가중치가 부여된다.

긴 체인을 통한 지지가 가치가 떨어진다는 것을 반영한다.

이 방법은 누가 지지하는지뿐만 아니라 얼마나 많은 사람들이 개인을 지지하는지 간접적으로 고려한다: 만약 노드 $A=(a_{i,j})$ 가 노드 $a_{i,j} = q_{i,j}/q_j$ 를 가리키고 $a_{i,j}$ 에 많은 경로가 도달하는 경우, $i$ 로 이어지는 경로들은 한 단계 더 진행하여 $j$ 에도 도달한다.

💡

- Katz의 모델은 사회 네트워크 내에서 개인의 지위나 영향력을 평가하는 데 있어 경로의 길이와 각 경로의 가중치를 고려 - 단순히 많은 사람들에게 지지를 받는 것뿐만 아니라, 그 지지하는 사람들의 네트워크 내 위치와 영향력도 중요 예를 들어, 많은 사람들에게 영향력 있는 사람이 지지하는 경우, 그 지지는 더 큰 가중치를 가질 수 있음

Katz builds an adjacency matrix

j

such that

\pi_j

if person

j

chooses person

j

and

q_j

otherwise. He defines a matrix

q_{i,j}

, where

q_i

is an attenuation constant. Notice that the

i

component of

q_i = \Sigma_j q_{i,j}

is the number of paths of length

A=(a_{i,j})

from

a_{i,j} = q_{i,j}/q_j

a_{i,j}

, and this number is attenuated by

j

in the computation of

i

. Hence, the

π_j

component of the limit matrix

j

is the weighted number of arbitrary paths from

j

j

. Finally, the status of member

\pi_jq_j

, that is, the number of weighted paths reaching

A

. If the attenuation factor

\pi

, with

A

the spectral radius of

A

, then the above series for

\pi = \pi A + v

converges.

→ Katz는 인접행렬

v

를 구축한다 만약 사람

\pi_jq_j

가 사람

j

를 선택하면

A

이고 그렇지 않으면 0이다.

그는 행렬 $π$ 를 정의한다. 여기서 $A$ 는 감쇠 상수이다. $A$ 의

\pi = \pi A + v

구성 요소는 $v$ 에서 $S$ 로의 길이 $E$ 의 경로 수이며, $G$ 의 계산에서 $A$ 에 의해 감쇠된다.

따라서, 궁극적인 행렬 $xA=rx,x>0$ 의

x

구성 요소는 $Σ_ix_i=1$ 에서 $r$ 로의 임의의 경로 수의 가중치가 적용된 값이다.

마지막으로, 구성원의 지위 $A$ 는, 즉 $A$ 에 도달하는 가중치가 적용된 경로의 수이다.

감쇠 인자 $A$ 가

x

보다 작은 경우( $y$ 은 $L$ 의 스펙트럼 반지름), $x$ 에 대한 위의 과정은 수렴한다.

💡

Katz의 모델은 각 사람 간의 관계를 행렬을 통해 모델링하고, 이 관계의 강도와 경로의 길이를 고려하여 각 구성원의 사회적 지위를 계산한다. → 각 사람이 다른 사람에게 미치는 영향을 길이와 감쇠 인자를 통해 정량화 → 긴 경로는 감쇠 인자에 의해 덜 가중치가 부여되어, 먼 거리의 관계는 더 적게 고려된다. → 감쇠 인자 $y$ 의 설정으로, 이는 네트워크 내에서의 영향력 전파를 얼마나 멀리 고려할 것인지를 결정

Figure 4:An example of the Katz model using two attenuation factors:

y = Lx

and (the spectral radius of the adjacency matrix is 1). Each node is labelled with the Katz score corresponding to (top) and (bottom). Scores have been normalized to sum to 100.

→ Katz 모델의 예 두개의 감쇠 계수를 사용: , (인접행렬의 스펙트럼 반지름 ), 각 노드에는 다음과 같은 katz 점수가 표시됨. 상단: , 하단 ) 점수는 합이 100으로 정규화 됨

💡

Figure 4 illustrates the method with an example. Notice the important role of the attenuation factor: when it is large (close to 1/), long paths are devalued smoothly, and Katz scores are strongly correlated with PageRank ones. In the shown example, PageRank and Katz methods provide the same ranking of nodes when the attenuation factor is 0.9. On the other hand, if the attenuation factor is small (close to 0), then the contribution given by paths longer than 1 rapidly declines, and thus Katz scores converge to indegrees, the number of incoming links of nodes. In the example, when the attenuation factor drops to 0.1, nodes C and E switch their positions in the ranking: node E, which receives many short paths, significantly increases its score, while node C, which is the destination of just one short path and many (devalued) long ones, significantly decreases its score.

→

그림 4는 이 방법을 예로 들어 설명한다.

감쇠 인자의 중요한 역할을 보면 :

1) 감쇠 인자가 큰 경우(에 가까울 때), 긴 경로는 부드럽게 저평가되고, Katz 점수는 PageRank 점수와 강하게 상관관계를 가진다.

보여진 예시에서, 감쇠 인자가 0.9일 때 PageRank(0.85)와 Katz 방법은 노드들의 순위에서 동일한 결과를 제공한다. —>???

2) 반면에, 감쇠 인자가 작은 경우(0에 가까울 때), 1보다 긴 경로들에 의한 기여는 급격히 감소하고, 따라서 Katz 점수는 인디그리(노드의 들어오는 링크 수)에 수렴한다.

예시에서, 감쇠 인자가 0.1로 떨어질 때, 노드 C와 E는 순위에서 위치를 바꾼다: 많은 짧은 경로를 받는 노드 E는 점수가 크게 증가하는 반면, 하나의 짧은 경로와 많은 (저평가된) 긴 경로의 목적지인 노드 C는 점수가 크게 감소한다.

💡

Katz의 모델에서 감쇠 인자는 경로의 길이에 따라 각 경로에 부여되는 중요도를 결정 1) 감쇠 인자가 크면 긴 경로도 비교적 높은 가중치를 받아 Katz 점수가 PageRank 점수와 유사 2) 감쇠 인자가 작으면 짧은 경로만이 높은 가중치를 받아 Katz 점수는 노드의 인디그리와 유사

In 1965 Charles H. Hubbell generalizes the proposal of Katz. Given a set of members of a social context, Hubbell defines a matrix such that is the strength at which endorses . Interestingly, these weights can be arbitrary, and in particular, they can be negative. The prestige of a member is recursively defined in terms of the prestige of the endorsers and takes account of the endorsement strengths:

ㅤ

(5)

The term is an exogenous vector such that is a minimal amount of status assigned to from outside the system.

→ 1965년 Charles H. Hubbell은 사회적 맥락에서 사회 구성원에 대한 Katz의 제안을 일반화했다.

Hubbell은 라는 행렬을 정의했다. 여기서 는 가 를 지지하는 강도이다.

흥미롭게도 이 가중치들은 임의적(랜덤?)일 수 있으며 특히 음수일 수도 있다.

멤버의 명성은 지지자들의 명성과 지지 강도를 고려하여 순환적으로 정의된다:

여기서 는 외부적인 벡터이며, 는 시스템 외부에서 에게 할당된 최소 지위량이다.

💡

Hubbell의 모델은 사회적 네트워크 내에서 개인의 명성을 계산하는 데 사용 → 멤버 간의 상호 작용과 외부적인 요인을 모두 고려하는 복잡한 접근 방식 각 구성원의 명성은 그들이 받는 지지의 강도와 지지자들의 명성에 의존 cf. 가중치가 음수도 될 수 있다 → 어떤 멤버가 다른 멤버에게 부정적 영향도 미칠 수 있다??? cf. 외부요소 가 존재 → 각 멤버에게 최소 지위양? 존재?

The original aspects of the method are the presence of an exogenous initial input and the possibility of giving negative endorsements. A consequence of negative endorsements is that the status of an actor can also be negative. An actor that receives a positive (respectively, negative) judgment from a member of positive status increases (respectively, decreases) his prestige.

On the other hand, and interestingly, receiving a positive judgment from a member of negative status makes a negative contribution to the prestige of the endorsed member (if you are endorsed by some person affiliated to the Mafia your reputation might drop indeed).

Moreover, receiving a negative endorsement from a member of negative status makes a positive contribution to the prestige of the endorsed person (if the same Mafioso opposes you, then your reputation might raise).

→ 이 방법의 본래 측면은 외부적인 초기 입력이 있고 부정적인 지지를 줄 수 있다는 것이다.

부정적인 지지의 결과로, 배우의 지위가 부정적일 수도 있다. 어떤 배우가 긍정적인(또는 부정적인) 지위를 가진 멤버로부터 긍정적인(또는 부정적인) 평가를 받으면 그의 명성이 증가(또는 감소)한다.

반면에, 흥미롭게도, 부정적인 지위를 가진 멤버로부터 긍정적인 평가를 받는 것은 지지받은 구성원의 명성에 부정적인 기여를 한다(예를 들어, 마피아와 연계된 사람에게 지지받으면 실제로 명성이 떨어질 수 있다).

또한, 부정적인 지위를 가진 멤버로부터 부정적인 지지를 받는 것은 지지받은 사람의 명성에 긍정적인 기여를 한다(동일한 마피아가 당신을 반대한다면, 당신의 명성이 올라갈 수 있다).

💡

사회적 명성이 단순히 지지의 양에 따라 결정되지 않는다는 것을 나타냄. 지지의 질과 지지하는 사람의 명성도 중요한 요소이다. 긍정적인 지위를 가진 사람 → 지지 → 명성 증가 부정적인 지위를 가진 사람 → 지지 → 명성 감소 개인의 명성이 단순히 긍정적인 인식만이 아니라 그들과 관련된 다른 사람들의 명성에 따라 변화할 수 있다

Figure 5:An instance of the Hubbell model with solution: each node is labelled with its prestige score and each edge is labelled with the endorsement strength between the connected members; negative strength is highlighted with dashed edges. The minimal amount of status has been fixed to 0.2 for all members.

→ 해가 있는 Hubbel모델의 예: 각 노드에는 명성 점수가 표시, 각 간선에는 연결된 멤버간의 지지 강도가 표시, 음수 강도는 dash 간선으로 강조됨, 모든 멤버에 대해 지위는 최소 0.2로 고정됨

💡

Figure 5 shows an example for the Hubbell model. Notice that Charles does not receive any endorsement and hence has the minimal amount of status given by default to each member. David receives only negative judgments; interestingly, the fact that he has a positive self opinion further decreases his status. A better strategy for him, knowing in advance of his negative status, would be to negatively judge himself, acknowledging the negative judgment given by the other members.

→ 그림 5는 Hubbell 모델에 대한 예시를 보여준다. Charles는 어떠한 지지도 받지 않으므로, 각 멤버에게 기본적으로 주어진 최소한의 지위(0.2)를 갖는다. David는 오직 부정적인 평가만을 받는다.

흥미롭게도, 자신에 대한 긍정적인 자기 의견을 가짐으로써 그의 지위는 더욱 감소한다. Dabvid에게 더 나은 전략은, 다른 구성원들로부터 부정적인 평가를 받는 것을 미리 알고 있다면, 자신을 부정적으로 평가함으로써 다른 구성원들의 부정적인 평가를 인정하는 것이다.

💡

Hubbell 모델에서 개인의 지위가 어떻게 결정되는지를 예를 통해 나타냄. Charles는 네트워크 내에서 어떤 긍정적인 지지도 받지 않기 때문에 기본적으로 할당된 최소한의 지위를 유지 David는 다른 구성원들로부터 부정적인 평가를 받고 → 자신에 대한 긍정적인 자기 평가는 오히려 그의 지위를 감소시킴.

Equation 5 is equivalent to , where is the identity matrix, that is . The series converge if and only if the spectral radius of is less than 1. It is now clear that the Hubbell model is a generalization of the Katz model to general matrices that adds an initial exogenous input . Indeed, Katz equation for social status is , where is a vector of all ones. In an unpublished note Vigna traces the history of the mathematics of spectral ranking and shows that there is a reduction from the path summation formulation of Hubbell-Katz to the eigenvector formulation with teleportation of PageRank and vice versa. In the mapping the attenuation constant is the counterpart of the PageRank damping factor, and the exogenous vector corresponds to the PageRank personalization vector. The interpretation of PageRank as a sum of weighted paths is also investigated in [2].

→ 수식 5는 와 동일하며, 여기서 는 단위 행렬이다.

즉, 이다. 의 스펙트럼 반지름이 1보다 작다면 수렴한다(?).

Hubbell 모델은 일반 행렬에 초기 외부 입력 를 추가한 Katz 모델의 일반화이다.

실제로, Katz의 사회적 지위에 대한 방정식은 이며, 여기서 는 모든 요소가 1인 벡터이다.

Vigna는 출판되지 않은 노트에서 Hubbell-Katz의 경로 합산 공식과 PageRank의 고유벡터 공식 및 텔레포테이션과의 관계를 추적하고, 이 두 공식 간의 상호 변환 가능성을 보여준다.

이 매핑에서 감쇠 상수는 PageRank 댐핑 인자의 대응이며, 외부 벡터는 PageRank 개인화 벡터에 해당한다. PageRank를 가중된 경로의 합으로 해석하는 것도 [2]에서 조사되었다.

💡

(5) —>에서 유도 가능 Hubbell 모델은 초기 외부 입력 를 포함하여 Katz 모델을 일반화한 것 각 멤버의 지위는 외부 벡터 와 행렬 의 거듭제곱에 의해 영향을 받으며 사회적 네트워크 내에서 경로의 길이와 강도를 고려한다. 또한, PageRank의 원리와 비슷하게 네트워크 내의 복잡한 상호작용을 고려하여 각 노드의 중요도를 평가.

Spectral ranking methods have been also exploited to rank sport teams in competitions that involve teams playing in pairs [32, 13]. The underlying idea is that a team is strong if it won against other strong teams. Much of the art of the sport ranking problem is how to define the matrix entries expressing how much team is better than team (e.g., we could pick to be 1 if beats , 0.5 if the game ended in a tie, and 0 otherwise) [12].

→ 스펙트럼 순위 결정 방법은 팀 간의 경기에 참여하는 스포츠 대회의 팀 순위를 매기는 데에도 활용되었다.

이 방법의 기본 아이디어는 팀이 강하다는 것은 그 팀이 다른 강한 팀들에게 이겼다는 것을 의미합니다. 스포츠 순위 문제의 대부분의 기술은 은 팀 가 팀 보다 얼마나 더 우수한지를 나타내는 행렬 항목 를 어떻게 정의하는지에 달려 있다.

(예를 들어, 가 를 이긴 경우 를 1로, 게임이 무승부로 끝난 경우 0.5로, 그렇지 않은 경우 0으로 설정할 수 있다.)

💡

스포츠 팀 간의 경기 결과를 행렬로 나타내어 각 팀의 강도를 계산할 수 있다. → 경기 결과에 따라 다른 값을 갖는다. 승리 : 1 무승부: 0.5 패배: 0

Econometrics

We conclude with a succinct description of the input-output model developed in 1941 by Nobel Prize winner Wassily W. Leontief in the field of econometrics – the quantitative study of economic principles. According to the Leontief input-output model, the economy of a country may be divided into any desired number of sectors, called industries, each consisting of firms producing a similar product. Each industry requires certain inputs in order to produce a unit of its own product, and sells its products to other industries to meet their ingredient requirements. The aim is to find prices for the unit of product produced by each industry that guarantee the reproducibility of the economy, which holds when each sector balances the costs for its inputs with the revenues of its outputs. In 1973, Leontief earned the Nobel Prize in economics for his work on the input-output model. An example is provided in Table 2.

→ 경제계량학, 즉 경제 원리의 양적 연구 분야에서, 노벨상 수상자인 Leontief가 1941년도에 개발한 입출력 모델에 따르면, 한 국가의 경제는 원하는 수의 부문으로 나눌 수 있으며, 이러한 부문은 각각 비슷한 제품을 생산하는 기업들로 구성된 산업으로 불린다.

각 산업은 자체 제품의 단위를 생산하기 위해 특정 입력을 필요로 하며 다른 산업의 재료 요구사항을 충족시키기 위해 자신의 제품을 판매한다.

목표는 각 산업이 생산하는 제품 단위에 대한 가격을 찾는 것으로 이는 경제의 재생산성이 보장될 때 달성된다. 이는 각 부문이 자신의 입력에 대한 비용과 출력의 수익을 균형잡을 때 유지된다.

1973년, Leontief는 입력-출력 모델에 대한 그의 연구로 경제학 분야에서 노벨상을 수상했다. 예시는 표 2에 제공된다.

💡

Leontief의 입력-출력 모델은 각 산업의 상호 의존성을 분석하여 경제의 전체적인 작동 방식을 이해하는 데 사용된다. 각 산업이 다른 산업으로부터 필요한 입력을 받고 자신의 출력을 다시 다른 산업에 제공하는 방식을 수학적으로 모델링.

ㅤ	agriculture	industry	family	total	price	revenue
agriculture	7.5	6	16.5	30	20	600
industry	14	6	30	50	15	750
family	80	180	40	300	3	900
cost	600	750	900	ㅤ	ㅤ	ㅤ

Table 2:An input-output table for an economy with three sectors with the balance solution. Each row shows the output of a sector to other sectors of the economy. Each column shows the inputs received by a sector from other sectors. For each sector we also show total quantity produced, equilibrium unitary price, total cost, and total revenue. Notice that each sector balances costs and revenues.

→ 표 2는 세 개의 산업 부문이 있는 경제에 대한 입력-출력 표와 균형 해결 방안을 보여준다. 각 행은 경제의 다른 부문으로의 한 산업 부문의 출력을 보여주고, 각 열은 다른 부문으로부터 한 산업 부문이 받는 입력을 보여준다. 각 산업 부문에 대해서는 총 생산량, 평형 단위 가격, 총 비용, 총 수익도 함께 나타난다. 각 부문은 비용과 수익을 균형잡는 것을 볼 수 있다.

💡

Let denote the quantity produced by the th industry and used by the th industry, and be the total quantity produced by sector , that is, . Let be such that ; each coefficient represents the amount of product (produced by industry) consumed by industry that is necessary to produce a unit of product . Let be the price for the unit of product produced by each industry . The reproducibility of the economy holds when each sector balances the costs for its inputs with the revenues of its outputs, that is:

By dividing each balance equation by we have

or, in matrix notation,

ㅤ

(6)

→ 는 i번째 산업이 생산하고 j번째 산업이 사용하는 생산량을 나타낸다.

는 부문 가 생산하는 총 생산량(표의 Total과 동일??? XX)이다.

즉, 이다.

는 로 정의되며, 각 계수 는 제품 의 단위를 생산하는 데 필요한 산업 에 의해 생산된 제품의 양을 나타낸다.

는 각 산업 가 생산하는 제품의 단위당 가격이다.

경제의 재생산성은 각 부문 가 입력에 대한 비용을 출력의 수익과 균형잡을 때 유지된다. 즉:

💡

: i번째 산업이 생산하고 j번째 산업이 사용(생산량) : 가 생산하는 총 생산량 : 제품 를 생산하는데 산업 에 의해 생산된 제품의 양 → (번째 산업이 생산하고 번째 산업이 사용) / (가 생산하는 총 생산량) → 제품을 1개 생산할 때 사용하는 산업의 제품의 비율(?) : 각 산업 가 생산하는 제품 1개 가격 각 산업 부문은 서로 다른 산업 부문으로부터 필요한 입력을 받고 자신의 제품을 다른 산업 부문에 제공 계산을 통해 각 산업 부문의 총 생산량과 각 산업 부문이 다른 부문에 제공하는 제품의 양이 결정

Hence, highly remunerated industries (industries with high total revenue ) are those that receive substantial inputs from highly remunerated industries, a circularity that closely resembles the PageRank thesis. With the same argument used in [9] for the Pinski and Narin bibliometric model we can show that the spectral radius of matrix is 1, thus the equilibrium price vector is the dominant eigenvector of matrix . Such a solution always exists, although it might not be unique, unless is irreducible. Notice the striking similarity of the Leontief closed model with that proposed by Pinski and Narin. An open Leontief model adds an exogenous demand and creates a surplus of revenue (profit). It is described by the equation where is the profit vector. Hubbell himself observes the similarity between his model and the Leontief open model.

→ 따라서, 높은 총 수익()을 가진 산업(산업 )은 높은 총 수익을 가진 산업으로부터 상당한 입력을 받는 산업이다.

이는 PageRank 논리와 밀접하게 닮은 순환성을 갖는다.

Pinski와 Narin의 학술 문헌 계량 모델에 대해 [9]에서 사용된 같은 논리를 사용하여, 행렬 의 스펙트럼 반지름이 1임을 보일 수 있으며, 따라서 평형 가격 벡터 는 행렬 의 dominant 고유벡터이다.

이러한 해는 항상 존재하지만, 가 irreducible이 아닌 경우에는 유일하지 않을 수 있다.

Leontief(1941)의 closed 모델과 Pinski와 Narin(1953)이 제안한 모델 사이의 놀라운 유사성이 있다. open Leontief 모델은 외부적인 수요를 추가하고 수익(이익)의 초과분을 생성한다.

이는 라는 방정식으로 설명되며, 여기서 는 이익 벡터이다.

Hubbell(1965) 자신도 그의 모델과 Leontief(1941)의 열린 모델 간의 유사성을 관측했다.

💡

Leontief 모델이 경제 내에서 각 산업 부문의 상호 의존성을 어떻게 나타내는지를 설명 높은 수익을 내는 산업은 다른 높은 수익을 내는 산업으로부터 중요한 입력(제품)을 받는 경향이 있다 → PageRank 알고리즘이 웹 페이지의 중요도를 평가하는 방식과 유사

It might seem disputable to juxtapose PageRank and Leontief methods. To be sure, the original motivation of Leontief work was to give a formal method to find equilibrium prices for the reproducibility of the economy and to use the method to estimate the impact on the entire economy of the change in demand in any sectors of the economy. Leontief, to the best of our limited knowledge, was not motivated by an industry ranking problem. On the other hand, the motivation underlying the other methods described in this paper is the ranking of a set of homogeneous entities. Despite the original motivations, however, there are more than coincidental similarities between the Leontief open and closed models and the other ranking methods described in this paper.

These connections motivated the discussion of the Leontief contribution, which is probably the least known among the surveyed methods within the computing community.

→ PageRank와 Leontief 방법을 나란히 놓는 것은 의문의 여지가 있을 수 있다.

(→순위 매기기가 아니므로)

분명히, Leontief 작업의 원래 동기는 경제의 재생산성에 대한 평형 가격을 찾기 위한 형식적인 방법을 제공하고, 경제의 어떤 부문에서의 수요 변화가 전체 경제에 미치는 영향을 추정하는 데 해당 방법을 사용하는 것이었다.

Leontief는, 우리가 제한된 지식으로 알기론, 산업 순위 문제에 의해 동기 부여되지 않았다.

반면에, 이 논문에서 설명된 다른 방법들의 근본적인 동기는 비슷한 개체(산업) 집합의 순위를 매기는 것이다.

원래의 동기에도 불구하고, Leontief의 opened 및 closed 모델과 이 논문에서 설명된 다른 순위 결정 방법들 사이에는 우연이 아닌 유사점이 많다.

이러한 연결성이 Leontief의 기여에 대한 논의를 동기 부여했으며, 컴퓨팅 커뮤니티 내에서 조사된 방법들 중 가장 덜 알려진 것으로 보인다.

💡

Leontief 모델과 PageRank 및 다른 순위 결정 방법들 사이의 연관성을 논의함. Leontief의 원래 목적은 경제 내에서 각 산업 부문의 평형 가격을 설정하고, 경제 전반에 대한 수요 변화의 영향을 평가하는 것. 반면 PageRank와 같은 다른 방법들은 주로 동질적인 엔터티들의 순위를 매기는 데 초점을 맞추고 있음. 그러나 이 논문은 Leontief 모델과 다른 순위 결정 방법들 사이의 유사성을 탐색하고 있으며, 이러한 비교는 경제학과 컴퓨팅 분야 사이의 상호 작용을 보여주는 중요한 사례로 볼 수 있음.

Conclusion

The classic notion of quality of information is related to the judgment given by few field experts. PageRank introduced an original notion of quality of information found on the Web: the collective intelligence of the Web, formed by the opinions of the millions of people that populate this universe, is exploited to determine the importance, and ultimately the quality, of that information.

→ 정보의 질에 대한 고전적인 개념은 소수의 현장 전문가가 내린 판단과 관련이 있다.

PageRank는 웹에서 발견되는 정보의 품질에 대한 독창적인 개념을 도입했다: 웹을 구성하는 수백만 명의 사람들의 의견에 의해 형성된 집단 지성을 활용하여 그 정보의 중요성과 궁극적으로는 그 질을 결정한다.

Consider the difference between expert evaluation and collective evaluation. The former tends to be intrinsic, subjective, deep, slow and expensive. By contrast, the latter is typically extrinsic, democratic, superficial, fast and low-cost. Interestingly, the dichotomy between these two evaluation methodologies is not peculiar to information found on the Web. In the context of assessment of academic research, peer review – the evaluation of scholar publications given by peer experts working in the same field of the publication – plays the role of expert evaluation. Collective evaluation consists in gauging the importance of a contribution though the bibliometric practice of counting and analysing citations received by the publication from the academic community. Citations generally witness the use of information and acknowledge intellectual debt. Eigenfactor, a PageRank-inspired bibliometric indicator, is among the most interesting recent proposals to collectively evaluate the status of academic journals. The consequences of a shift from peer review to bibliometric evaluation are currently heartily debated in the academic community.

→ 전문가 평가와 집단 평가 사이의 차이를 생각해보자.

전문가 평가는 본질적이고, 주관적이며, 심층적이고, 느리며 비용이 많이 든다.

반면에 집단 평가는 일반적으로 외재적이고, 민주적이며, 피상적이고, 빠르며 저렴하다.

흥미롭게도, 이 두 평가 방법론 사이의 이분법은 웹상의 정보에만 국한되지 않는다.

학술 연구의 평가 맥락에서, 동료 평가(동일 분야에서 활동하는 전문가들에 의한 학술 출판물의 평가)는 전문가 평가의 역할을 한다.

집단 평가는 학술 커뮤니티로부터 받은 인용을 계산하고 분석하는 학술 문헌 계량학적 관행을 통해 기여도의 중요성을 측정한다.

인용은 일반적으로 정보의 사용을 증명하고 지적 빚을 인정한다.

Eigenfactor는 PageRank에서 영감을 받은 학술 문헌 계량학적 지표 중 가장 흥미로운 최근 제안 중 하나로, 학술 저널의 지위를 집단적으로 평가한다.

동료 평가에서 학술 문헌 계량학적 평가로의 전환의 결과는 현재 학술 커뮤니티에서 열띤 논쟁의 대상이다.

💡

전문가 평가와 집단 평가 방법 사이의 차이와 그 의미를 설명한다. 전문가 평가는 깊이 있는 분석을 제공하지만 느리고 비용이 많이 든다. 집단 평가는 더 빠르고 비용 효율적이지만, 때로는 피상적일 수 있다. 학술 연구 평가에서 동료 평가는 전문가의 심층적인 분석을 제공하는 반면, 인용 수와 같은 학술 문헌 계량학적 방법은 연구의 영향력과 인지도를 보다 넓은 관점에서 평가한다. Eigenfactor와 같은 현대 학술 문헌 계량학적 지표들은 학술 저널의 중요성을 평가하는 데 새로운 방법을 제시하며, 이는 학술 커뮤니티에서 전문가 평가와 집단 평가의 역할에 대한 논의를 촉발하고 있다.

Bibliometrics

흐름

1963: Eugen Garfield에 의해 제안, 현재도 사용하는데 가장 널리 알려진 지표(IF) 계산: 특정 저널의 논문이 최근 두 해 동안 받은 인용 횟수의 평균으로 계산 저널의 영향력과 중요도를 평가, But 인용한 저널의 중요성 고려 X -> 문제: 인용한 저널의 중요성 고려 X, 모든 저널의 가중치가 동일

1976: Prinski Narin의 이론: 저널은 영향력 있는 저널에서 인용될 경우 영향력이 있다.(유명저널 / 무명저널의 가중치를 다르게 둠) 계산: PageRank의 순환이론과 동일 저널 : 노드 = 인용관계 : 간선(가중치)

Sociometry

흐름

1949: John R. seeley가 제안 / PageRank에서 사용하는 순환 주장을 이 분야에 적용한 최초의 사람 어린이의 인기도가 단순히 그들을 선택한 어린이들의 수에 의해 결정되는 것이 아니라, 그 선택한 어린이들의 자체 인기도에 의해 영향을 받는다는 아이디어를 제시 → PageRank 알고리즘이 웹 페이지의 중요도를 평가하는 방식과 유사

1953: Leo Katz의 제안 모델(방향 그래프) 계산: 긴 경로는 짧은 경로보다 적은 가중치를 부여 -> HITS 모델과 비슷 / 인접행렬 L을 기준으로 지지하면 1, 아니면 0 -> PageRank와 비슷 / Google 행렬 G처럼 감쇠 계수 사용 / 근데 convex combination은 아님 1) 감쇠 인자가 크면 먼 경로(즉, 여러 노드를 거쳐)를 통해 들어온 값도 높은 가중치를 받아 Katz 점수가 높음 2) 감쇠 인자가 작으면 먼 경로는 거의 상쇄되고, 가까운 경로(즉, 직전의 노드)를 통해 들어온 값에만 높은 가중치를 받음 노드 : 사람 = 누군 가를 지지 : 간선(가중치) -> 누가 지지하는지(직접반영) + 얼마나 많은 사람들이 지지하는지(간접 반영)

1965: Chales H. Hubbell: Katz의 제안을 사회 구성원 측면에서 일반화함 Prinski 처럼 사람마다의 중요도(영향력 / 명성)이 있으므로 그에 따라 지지의 질이 달라진다는 이론 사회적 명성이 단순히 지지의 양에 따라 결정되지 않는다는 것을 나타냄. 지지의 질과 지지하는 사람의 명성도 중요한 요소 -> 가중치 행렬 W(: i가 j를 지지하는 강도 / 임의적(랜덤?) 음수일 수 있다. -> 즉, 부정적 지지도 있음 -> v 외부요소 (최소 명성?? 지위양?) BASE 최소 지위: 여기서는 0.2 노드 : 사람의 명성 = 간선 : 지지 강도( + : 긍정적 지지 / - : 부정적 지지) 계산: 부정적인 명성을 갖고 있는 A --> + 지지 --> B ==> B의 명성에 부정적인 기여를 함 부정적인 명성을 갖고 있는 A --> - 지지 --> B ==> B의 명성에 긍정적인 기여를 함

스포츠 순위 결정에도 활용 기본 아이디어: 팀이 강하다는 것 -> 강한 팀에게 이겼다는 것 스포츠 팀 간의 경기 결과를 행렬로 나타내어 각 팀의 강도를 계산 → 경기 결과에 따라 다른 값을 갖는다. 승리 : 1 무승부: 0.5 패배: 0

Econometrics

흐름

1941: Leontief(노벨상 수상자): 입출력 모델 아이디어: 경제를 원하는 수의 분야로 나눌 수 있다. ex) 제조업 / 어업 등 입력(열): 제품 생산 출력(행): 제품 판매 -> 목표: 각 산업이 생산하는 제품 단위에 대한 가격을 찾는 것 -> 경제의 재생산성이 보장될 때 달성 됨 즉, 각 산업이 자신의 입력에 대한 비용과 출력의 수익의 균형을 이룰 때