유사한 결과를 찾고 유사성을 기준으로 정렬하는 방법은 무엇입니까?

Nice programing

유사한 결과를 찾고 유사성을 기준으로 정렬하는 방법은 무엇입니까?

nicepro 2020. 11. 11. 20:40

유사한 결과를 찾고 유사성을 기준으로 정렬하는 방법은 무엇입니까?

유사성에 따라 정렬 된 레코드를 쿼리하려면 어떻게합니까?

예 : "Stock Overflow"를 검색하면

스택 오버플로
SharePoint 오버플로
수학 오버플로
정치적 범람
VFX 오버플로

예 : "LO"를 검색하면 다음이 반환됩니다.

파블로 피카소
미켈란젤로
잭슨 폴록

도움이 필요한 사항 :

더 나은 결과를 위해 검색 엔진을 사용하여 MySQL 테이블 색인 및 검색
- PHP로 Sphinx 검색 엔진 사용
- PHP와 함께 Lucene 엔진 사용
전체 텍스트 인덱싱을 사용하여 유사하거나 포함 된 문자열 찾기

잘 작동하지 않는 것

Levenshtein 거리는 매우 불규칙합니다. ( UDF , Query )
"dog"를 검색하면 다음과 같은 결과가 나타납니다.
1. 개
2. 변소
3. ...전에
4. 큰
5. 에코
LIKE 더 나은 결과를 반환하지만 유사한 문자열이 존재하더라도 긴 쿼리에 대해서는 아무것도 반환하지 않습니다.
1. 개
2. Dogid
3. 도가 랄
4. 교조

다른 전체 문자열에 대해 전체 문자열을 검색 할 때 Levenshtein 거리가 좋을 수 있지만 문자열 내에서 키워드를 찾을 때이 메서드가 원하는 결과를 반환하지 않는다는 것을 발견했습니다. 또한, SOUNDEX 기능은 영어 이외의 언어에는 적합하지 않아 상당히 제한적입니다. LIKE를 피할 수는 있지만 실제로는 기본 검색 용입니다. 달성하려는 다른 검색 방법을 살펴볼 수 있습니다. 예를 들면 :

당신은 사용할 수 있습니다 루씬을 프로젝트의 검색 기준으로. 대부분의 주요 프로그래밍 언어로 구현되며 매우 빠르고 다양합니다. 이 방법은 하위 문자열뿐만 아니라 문자 전치, 접두사 및 접미사 (모두 결합)를 검색하기 때문에 아마도 가장 좋습니다. 그러나 별도의 인덱스를 유지해야합니다 (CRON을 사용하여 가끔씩 독립적 인 스크립트에서 업데이트 할 수 있음).

또는 MySQL 솔루션을 원할 경우 전체 텍스트 기능이 상당히 우수하고 저장 프로 시저보다 확실히 빠릅니다. 테이블이 MyISAM이 아닌 경우 임시 테이블을 만든 다음 전체 텍스트 검색을 수행 할 수 있습니다.

CREATE TABLE IF NOT EXISTS `tests`.`data_table` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(2000) CHARACTER SET latin1 NOT NULL,
  `description` text CHARACTER SET latin1 NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;

직접 생성하지 않으려면 데이터 생성기 를 사용하여 임의의 데이터를 생성하십시오.

** 참고 ** : 열 유형은 latin1_bin대소 문자를 구분하지 않는 대신 대소 문자 구분 검색을 수행 해야합니다 latin1. 유니 코드 문자열의 경우 대소 문자를 utf8_bin구분하고 utf8_general_ci대소 문자를 구분하지 않는 검색을 권장 합니다 .

DROP TABLE IF EXISTS `tests`.`data_table_temp`;
CREATE TEMPORARY TABLE `tests`.`data_table_temp`
   SELECT * FROM `tests`.`data_table`;

ALTER TABLE `tests`.`data_table_temp`  ENGINE = MYISAM;

ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` (
  `title` ,
  `description`
);

SELECT *,
       MATCH (`title`,`description`)
       AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score`
  FROM `tests`.`data_table_temp`
 WHERE MATCH (`title`,`description`)
       AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE)
 ORDER BY `score` DESC;

DROP TABLE `tests`.`data_table_temp`;

MySQL API 참조 페이지 에서 자세히 알아보세요 .

이것의 단점은 문자 조옮김이나 "유사하고 비슷하게 들리는"단어를 찾지 않는다는 것입니다.

** 업데이트 **

검색을 위해 Lucene을 사용하면이 작업이 PHP 스크립트 (ig "cd / path / to / script; php searchindexer.php"를 실행하는 크론 작업 (모든 웹 호스트에이 "기능"있음))를 생성하기 만하면됩니다. ) 색인을 업데이트합니다. 그 이유는 수천 개의 "문서"(행, 데이터 등)를 인덱싱하는 데 몇 초, 심지어 몇 분이 걸릴 수 있지만 이는 모든 검색이 가능한 한 빨리 수행되도록하기위한 것입니다. 따라서 서버에서 실행할 지연 작업을 만들 수 있습니다. 하룻밤이 될 수도 있고 다음 시간이 될 수도 있습니다. PHP 스크립트는 다음과 같아야합니다.

$indexer = Zend_Search_Lucene::create('/path/to/lucene/data');

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  // change this option for your need
  new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

$rowSet = getDataRowSet();  // perform your SQL query to fetch whatever you need to index
foreach ($rowSet as $row) {
   $doc = new Zend_Search_Lucene_Document();
   $doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8'))
       ->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8'))
       ->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable))
       ->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8'))
  ;
  $indexer->addDocument($doc);
}

// ... you can get as many $rowSet as you want and create as many documents
// as you wish... each document doesn't necessarily need the same fields...
// Lucene is pretty flexible on this

$indexer->optimize();  // do this every time you add more data to you indexer...
$indexer->commit();    // finalize the process

그런 다음 기본적으로 검색하는 방법 (기본 검색)입니다.

$index = Zend_Search_Lucene::open('/path/to/lucene/data');

// same search options
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
   new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');

$query = 'php +field1:foo';  // search for the word 'php' in any field,
                                 // +search for 'foo' in field 'field1'

$hits = $index->find($query);

$numHits = count($hits);
foreach ($hits as $hit) {
   $score = $hit->score;  // the hit weight
   $field1 = $hit->field1;
   // etc.
}

다음은 Java , PHP 및 .Net의 Lucene에 대한 훌륭한 사이트 입니다.

In conclusion each search methods have their own pros and cons :

You mentioned Sphinx search and it looks very good, as long as you can make the deamon run on your web host.
Zend Lucene requires a cron job to re-index the database. While it is quite transparent to the user, this means that any new data (or deleted data!) is not always in sync with the data in your database and therefore won't show up right away on user search.
MySQL FULLTEXT search is good and fast, but will not give you all the power and flexibility of the first two.

Please feel free to comment if I have forgotten/missed anything.

1. Similarity

For Levenshtein in MySQL I found this, from ~~www.codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function~~

SELECT 
    column, 
    LEVENSHTEIN(column, 'search_string') AS distance 
FROM table 
WHERE 
    LEVENSHTEIN(column, 'search_string') < distance_limit
ORDER BY distance DESC

2. Containing, case insensitive

Use the LIKE statement of MySQL, which is case insensitive by default. The % is a wildcard, so there may be any string before and after search_string.

SELECT 
    *
FROM 
    table
WHERE 
    column_name LIKE "%search_string%"

3. Containing, case sensitive

The MySQL Manual helps:

The default character set and collation are latin1 and latin1_swedish_ci, so nonbinary string comparisons are case insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case sensitive, make sure that one of the operands has a case sensitive or binary collation. For example, if you are comparing a column and a string that both have the latin1 character set, you can use the COLLATE operator to cause either operand to have the latin1_general_cs or latin1_bin collation...

My MySQL setup does not support latin1_general_cs or latin1_bin, but it worked fine for me to use the collation utf8_bin as binary utf8 is case sensitive:

SELECT 
    *
FROM 
    table
WHERE 
    column_name LIKE "%search_string%" COLLATE utf8_bin

2. / 3. sorted by Levenshtein Distance

SELECT 
    column, 
    LEVENSHTEIN(column, 'search_string') AS distance // for sorting
FROM table 
WHERE 
    column_name LIKE "%search_string%"
    COLLATE utf8_bin // for case sensitivity, just leave out for CI
ORDER BY
    distance
    DESC

It seems that your definition of similarity is semantic similarity. So in order to build such a similarity function, you should use semantic similarity measures. Note that the scope of work on the issue might vary from few hours to years so it is recommended to decide on the scope before getting into work. I didn’t figure out which data do you have in order to build the similarity relation. I assume the you have access the a dataset of documents and a dataset of queries. You can start with co-occurrence of the words (e.g., conditional probability). You will discover quickly that you get the list of stop words as related the most of the words simply because they are very popular. Using the lift of conditional probability will take care of the stop words but will make the relation prone to error in small number (most of your cases). You might try Jacard but since it is symmetric there will be many relations it won't find. Then you might consider relations that appear only in short distance from the base word. You can (and should) consider relations base on general corpus's (e.g., Wikipedia) and user specific (e.g., his emails).

Very shortly you will have plenty of similarity measures, when all the measures are good and have some advantage over the others.

In order to combine such measures, I like to reduce the problem into a classification problem.

You should build a data set of paris of words and label them as "is related". In order to build a large labeled dataset you can:

Use sources of known related words (e.g., good old Wikipedia categories) for positives
Most of the word not known as related are not related.

Then use all the measures you have as features of the pairs. Now you are in the domain of supervised classification problem. Build a classifier on the data set, evaluated according to your needs and get a similarity measure that fits your needs.

참고URL : https://stackoverflow.com/questions/3338889/how-to-find-similar-results-and-sort-by-similarity

'Nice programing' 카테고리의 다른 글

MEF 대 모든 IoC (0)	2020.11.11
ORA-00972 식별자가 너무 긴 별칭 열 이름입니다. (0)	2020.11.11
CDI는 Spring의 좋은 대체품입니까? (0)	2020.11.11
Android 애플리케이션에서 런타임에 라이브러리를 동적으로로드 할 수 있습니까? (0)	2020.11.11
.NET 4.0 작업 패턴을 사용하여 HTTPClient .ReadAsAsync로 JSON을 배열 또는 목록으로 역 직렬화 (0)	2020.11.11

현재글유사한 결과를 찾고 유사성을 기준으로 정렬하는 방법은 무엇입니까?

nicepro

유사한 결과를 찾고 유사성을 기준으로 정렬하는 방법은 무엇입니까?

유사한 결과를 찾고 유사성을 기준으로 정렬하는 방법은 무엇입니까?

도움이 필요한 사항 :

잘 작동하지 않는 것

'Nice programing' 카테고리의 다른 글

'Nice programing'의 다른글

티스토리툴바

유사한 결과를 찾고 유사성을 기준으로 정렬하는 방법은 무엇입니까?

유사한 결과를 찾고 유사성을 기준으로 정렬하는 방법은 무엇입니까?

도움이 필요한 사항 :

잘 작동하지 않는 것

'Nice programing' 카테고리의 다른 글

'Nice programing'의 다른글

관련글

티스토리툴바